Tuesday, April 30, 2013

Can Computers Replace Teachers?

Can/Should computers replace teachers? It is clear to me that computers are not capable at this time of eliminating the role of teachers. But it is also evident that the number of students one teacher ought to be able to teach should be increased by use of technology for the mass presentation part of lessons. That should free up teachers for the parts that computers aren't yet able to handle, like open ended Q&A time, evaluating the quality of essays, projects and talks, and perhaps providing individual tutoring. The human aspect of teaching, which goes beyond the narrow bounds of "lessons" and covers things like motivating, comforting and counseling are additional weak spots in anticipating what a computer can do at this time (but do see the video linked to my blog post "Reading Faces"). It isn't so clear to me how a computer would cope with an unmotivated unruly student.

Not surprisingly, teachers are upset at the notion that fewer teachers would be needed in a school equipped with sufficient technology and that increasing the student/teacher ratio could result in big budget savings. But it is great fun to watch a good fight. To that end, I share with you today this 21-minute talk "Can Computers Replace Teachers?" by Katherine Mangu-Ward, and in rebuttal, see this blog post by a teacher, Joe Bower: http://www.joebower.org/2013/04/can-computers-replace-teachers.html. He provides a link to another article that is severely critical of the KIPP schools: "Poor Teaching for Poor Children...". I briefly talked about KIPP schools in my blog post about the book "How Children Succeed".

You may have noticed in my blog article that there were already differences between the approach of a KIPP school in an impoverished neighborhood vs. the approach taken by the more upscale Riverdale Country School. I suppose the moral of the story is that different kids have different needs and the "system" (whether that means a computer system, a school system or a teacher) needs to be able to sense and adapt to those needs. Perhaps this argues in favor of establishing separate tracks for different academic abilities and perhaps even for different social backgrounds. Now there's a controversial area! If the tracks are separate, how to assure that they are, in some sense, equal to each other? How "equal" ought the tracks to be?

I suspect the actual savings provided by use of technology is significantly smaller than the vendors project it to be. The "Total Cost of Ownership" over the long term has to factor in the cost of maintaining the network and systems and replacing obsolete equipment and keeping the software up to date. Prefabricated, canned, lessons need to be refreshed frequently to keep up with changes in the world. Arguably, math is a subject that doesn't need much change from year to year, but social studies clearly needs ties to current events to stay at all relevant. Biology is an example of a field with rapid change from year to year. Unchanging lessons in such a subject will be terribly far from state of the art in just a few years if significant effort isn't budgeted to keep the material current.

Perhaps in the future the pipeline for educating teachers will have different tracks for teachers interested in working directly with the students vs. folks interested in developing the class materials (the software). Of course the 2 sides need to stay in touch with each other, but I can see the need for fairly distinct skill sets.

As I've noted in past blog posts on the topic of "education", I'm a software engineer, not a professional teacher. I'm particularly interested in hearing other points of view on this topic of technology as a possible way to displace teachers. Feel free to share opinions in the comments section here or share a link to a blog post of your own if you'd like to have a less confined space to write in. e.g. Given a "canned" self-paced course like Udacity CS101 or some other MOOC course, what role, if any, is there for a locally available human teacher?

Tuesday, April 23, 2013

Robot Wars

Udacity.com CS101 has a unifying theme of building a web crawler and search engine as you learn Python. It's a well chosen project because it introduces HTML, and string processing and then segues into iteration, functions and recursion. As the index grows larger, data structures, hashing and the Python dictionary type are introduced, and improving the quality of search results gets some attention.

This article started out as a discussion thread in the CS101 "forum", but has been substantially re-worked and updated for this blog post. The original thread is here: http://forums.udacity.com/questions/100000043/robot-wars#cs101.

Having learned how to construct a web crawler in this course, I was amused in December to learn there's a company out there intent on protecting the web sites of their customers from being crawled. See: http://www.distil.it/the-dirty-secrets-about-robot-txt/#.ULo0_4c8CSo. A web site may post a robots.txt file on the root of its site to request web crawlers to stay out. But it's a request with no teeth. An ill-mannered or unscrupulous web crawling program might just ignore the robots.txt file and explore pages it was asked not to. The goal of Distil Networks is to block such crawlers. They don't say much about how they'd distinguish the web crawler traffic from real user traffic. Usage patterns, geographic location by IP address, that sort of stuff.

Well, I guess there's risk in publishing anything. IMHO, the web is for sharing information publicly. Hardly the right place to post anything you don't want others to see. The worry seems to be that if the info is too easily copyable by a crawler, that someone will duplicate your site and somehow profit from your work without compensating you.

Seems to me that a crawler that could alert you to duplicates of your page content would be useful. Of course, it would be pretty hard to do as the style wrapped around it could have changed so recognizing "your" content wouldn't necessarily be easy. It's a similar problem to those sites that do plagiarism detection in homework assignments.

Of course, if people are gathering up info from your site and not republishing it, but are finding a way to profit from the knowledge you shared, that seems fair. If they are spamming your customers, then maybe your site needs to keep customer contact info a bit closer to the vest. I believe there was an old business practice of salting confidential lists of customers with fake entries with phone numbers to trap any outsider trying to mine a stolen copy of the list. Similarly, I've heard tell of mapmakers putting some deliberate harmless error in their maps, so a copy could be recognized. Arguably, Google took somewhat that same approach to prove that Microsoft's Bing was copying Google search results. Microsoft, of course, strongly denied the allegation, but the matter did get the attention of even the Comedy Central cable TV network: http://mashable.com/2011/02/04/colbert-bing-google/.

Meanwhile, I think I see coming an escalating battle between sites and crawlers and expect I'll hate repeated interactions like captchas to see if I'm human, or complicated enough javascript tests to see if the query came from a real web browser, but slowing down the time to get to see the web content of interest and just raising the bar on how much like a browser the behavior of a web crawler needs to be.

To be fair, I freely admit that I have not tried the services of distil.it, so I have no real idea of how transparent their protections perhaps are, or any clue as to how easy is it for a sneaky web crawler to work around the restrictions. If you have any actual experience, good or bad, with such protections, please add some comments to this post to get the rest of us up to speed.

Sunday, April 14, 2013

Creeping up on OOP in Python, Part 3.

In Part 1 of this series, I looked at a solution to Project Euler problem 22 using the object-based pyparsing library. In Part 2 of this series I refined that code by introducing a simple object of my own creation. The change had the salulatory effect of moving the name-scoring initialization into a logical spot instead of leaving it up in the main routine.

Today, in Part 3, I look at a classroom assignment about multiple inheritance. On the one hand, working through this assignment did leave me with a new and better understanding of multiple inheritance. But, I confess to more than a little discomfort with this sort of assignment, dealing in obviously artificial A,B,C,D classes. What the assignment doesn't do is provide me with any insight as to why I might want as complex a class hierarchy as the A, B, C and D of this assignment. Perhaps I need to spend some time with some OOP text books to find practical examples of this sort of complexity. If someday the Aha! light clicks "on" for this matter for me, I promise to return with a Part 4 that has some less synthetic examples of multiple inheritance. Meanwhile, I confess that I've been looking at the Go programming language and it seems to have a much more obvious mechanism for "inheritance". But, not having written line 1 of Golang code, I'll not expound on that opinion here at this time.

The original problem posed is here: http://pastebin.com/ZgQAYPxw. It's just the kind of homework problem that I hate. It deals in a complex hierarchy of classes A, B, C and D, but does nothing to help me understand why a program might want such a hierarchy. I'm not exactly sure where the problem was posed, but I'll guess it was from the MIT 6.00x MOOC. At the point where my friend asked for help, he already had the correct answers and was concerned that he couldn't work out how they got those answers. On my first try I too came up with incorrect answers, so I had to dig a little deeper.

If you know something of Python multiple inheritance, now would be a good time to try to work out your answers to the questions in that assignment.

If you've read Part 1 and Part 2 of this series, you know I've had very little experience with classes and objects in Python, and have never dabbled at all in multiple inheritance. What I know is that inheritance is a way to make a specialized subclass of a class. e.g. suppose you have an employee object. Employee's have an ID number, a title, a location/room, office phone and so forth. Now suppose you want to define a manager object. A manager is an employee, but perhaps has an associated list of directly reporting employees (the folks he manages). So, to save having to re-implement all the basic stuff for the employee aspect of the manager, you instead derive the class for manager from the class for employee.

Fine. Where I get confused is when looking for good examples of multiple inheritance. Sitting here trying to make one up: Suppose we had a Stash class that knew how to pickle an object and write it to nonvolatile storage, returning a key value that you could later present to a method of Stash to retrieve the stored object. A "permanent" employee might usefully be derived from both the employee class and the stash class. But why multiple inheritance instead of deriving employee from stash and then manager from employee? The crystal grows dark. Perhaps if I get around to Part 4, I'll have good answers. If you can give me a clue that will further my education on this topic, do feel free to add a comment to this blog or if you'd like more room, post a link to a blog response of your own.

Back to A, B, C and D of the homework problem. The language is Python 2.7. That's potentially important because the method resolution order (MRO) changed in the move from Python 2.2 to Python 2.3. On http://www.python.org/download/releases/2.3/mro/, all the hairy details are explained and python code is given to inspect a class hierarchy. So I cross-bred that print_mro code with the original assignment and massaged the lines so the file was executable Python code. That gave me this: http://pastebin.com/CYmEeGds. My thinking was that by actually executing the code from the questions, I could see for certain if the official correct answers do come out, and then I can explore the test code to understand why those answers came out.

My first big surprise is that as given, print_mro only works on classes, not on objects. As I understand it, an object in python knows what class it is an instance of, so I believe that print_mro could be readily souped up to start from an object and report on the class MRO anyhow. I didn't bother - just lived with the limitation of only handing classes, not objects to the print_mro routine.

As the code now stands, this works:

obj = D()
print "D.__dict__=", D.__dict__

but print_mro(obj) fails.

A class has a set of built-in class attributes. CLASS.__dict__ is a dictionary of all those class attributes. See: http://www.tutorialspoint.com/python/python_classes_objects.htm. For all the foofaraw about MRO's the print_mro routine showed no surprises. The general rule is to find a method, start at the root of the hierarchy and work down and left to right. First match to the sought method name wins.

So the class attributes include __bases__, the parent classes of the class.

An object also has a set of built-in attributes: OBJECT.__class__ is the name of the object's class, and OBJECT.__dict__ is a dictionary of the object's attributes (just data attributes, like a, b, c, d in our example). Notice that __bases__ isn't lugged along with each instance of the class, just the class name and you'd have to look at the class's built in attributes for further info.

The output of running inherit.py is here: http://pastebin.com/FKhzJGRu Lines 76-77 show the linear list of the classes that make up class D. I see no surprise there. It's D, C, B, A. So when looking for a method (i.e. function) to invoke from an instance of D, that's the order of the classes that it'll look through in questing for that method.

But by peeking at the obj.__dict__ I see only simple values in that instance. e.g. {'a': 2, 'c': 5, 'b': 3, 'd': 6}, not the complicated class hierarchy I was expecting to see reflected in the data too. I was expecting to see an "a" attribute associated with class B and another from class A. But it seems the only "a" is for obj's class D. So the __init__ methods get invoked when obj is instantiated and the last to run is the one for class A, so the value of the "a" attribute set by B.__init__() is the one that prevails. A.__init__ sets "a" to 1, but it was invoked by B.__init__ and later in B.__init__, the "a" gets set to 2.

So that's quite different from what I expected to see, but perhaps is simpler than I was looking for. In any case, it does explain why obj.a ends up being 2.

Looking a bit further up in the hierarchy, D.__init__ invokes C.__init__ which sets "a" to 4, but next thing D.__init__ does is invoke B.__init__ so the "a" changes to 1 and then to 2 as I just described.

All those complex resolution order rules apply for find a method to apply to an object, but don't apply at all for a reference to a data attribute. If you are looking for an attribute named "a" the object's dictionary either has an entry for "a" or you are out of luck.

I still believe it'd be much more valuable if a real use for this stuff was contrived for the examples here instead of A, B, C and D. I haven't done any significant OOP, so I have no real experience with inheritance. Maybe if I had such experience, the merits of the Python implementation of multiple inheritance would be more obvious to me. Seems to me that having all the data attributes in one dictionary implies that in modifying any of the code of any of the classes in the complex hierarchy that a certain amount of care is needed to not introduce unintended name conflicts across the classes. My best guess is that someday when working a far more complex problem in Python that I'll suddenly recognize a burning need for multiple inheritance, but until then, I intend to follow the KISS principle.

If you know more about multiple inheritance in Python then I do (setting a very low bar...), please point out any errors in this article. I'm just sharing what I saw in my experiments, so there's plenty of room for me to get some or all of it outright wrong.

Wednesday, April 10, 2013

Reading Faces

For sure I'm going to have to re-watch Blade Runner. Not so surprising that a next logical step after pretty much working out facial recognition is to work on "reading" faces.


Could this become the killer app that sells Google Glass (the way that Lotus 1-2-3 [spreadsheet software] sold DOS PC's?)?

A tip of the hat to Mark Bruce for pointing the linked item out in the Google+ STEM community.

Tuesday, April 2, 2013

Creeping up on OOP in Python - Part 2.

Introduction to part 2 and Spoiler alert

In my previous blog post, I showed my first solution to Project Euler problem 22. For those of you who want to solve the problem on your own, I cautioned that my posting was a "spoiler". In today's article, part 2, I refine the code to use a Python object of my own. I believe the result is better code, but again, a "spoiler" if you'd rather develop your own independent implementation. Sorry, but if you want to remain untarnished by my prior art, please don't read any futher in this article and do not follow any of the links to the source code under discussion.

A classier solution to Problem 22

Well this has been fun! My improved version of problem22 is here: http://pastebin.com/agBBSQ9Z

In the previous version of the code, you may remember I had main() initialize valuedict so the value dictionary could be passed in to the namescorer routine. As I mentioned in part 1, I wasn't entirely happy with that arrangment as the valuedict was clearly an internal implementation detail of the namescorer routine. So, why should main() even know about that stuff?

The first big change was to add a NameScorer class. When you instantiate it, you tell it the alphabet of symbols that the NameScorer object will be asked to process. The first symbol is counted as value 1, the 2nd symbol is given a value of 2, etc. The __init__ routine of NameScorer builds the valuedict, which when you come down to it, is just an implementation detail of how we compute the score for a name. Assuming ASCII and using the ord(c)-ord('A') approach is another way to do it. The new approach removes the valuedict from main(). That looked out of place in main() and I either had to make it global (which is considered bad form) or do what my previous version did: pass the valuedict in to the nameScore routine each time it was called!. Happy to be rid of that.

What I learned as I wrote my first class is that a self. qualifier is very important when trying to refer to a symbol that is part of the instance of the class. Multiple times the tests aborted with a message that there's no such global symbol as valdict. I'd have to correct the code to say self.valdict and hit my forehead while exclaiming Doh! I kind of wish for a short cut like import has so you don't have to repeatedly qualify a symbol reference with the module name.

The next big change didn't actually survive to the final version. I had "for" loops that iterated over the alphabet or the name list, but I also had a counter that had to tick along with each iteration. That meant an initialization statement before the "for" and an incrementing statement in the body of the loop. There isn't wonderful documentation of the syntax in the Python books I have here, but you can say

     for (i, name) in

and thus have both the names and the counter automatically handled by the "for". That left almost nothing inside the "for" and it was then that I realized, the initialization of the value dictionary, could be handled by a "list comprehension" generating the tuples that then just become an argument to dict(). And the totaling of the weighted name scores in main could be done by invoking

    sum(i*ns.score(name) for (i, name) in
        zip(range(1,len(nameList)+1), namelist))

Best explanation of this that I've since found is Ned Batchelder's blog post "Loop like a native".

The books I have here say that current Python implementations implement list comprehensions like that by actually generating the list of tuples and iterating over that. Apparently the intent is that some day the tuples will be generated as needed. But heck. Memory is plentiful.

zip by the way is a function that given 2 lists, generates tuples by pairing together the i-th element in one list with the i-th element in the other list. It only generates as many tuples as the shorter of the 2 lists, but both my lists were the same length.

Somewhat invisible in the code is that while making these changes, I learned a few more basic tricks with the git version control system. I checked in yesterday's problem22.py to git. That version was implicitly the master branch.

Then I created an experimental branch (yeah, this is overkill for a one person operation, but remember I'm doing this to learn). Each time I had a working change to problem22.py, I'd check it into the experimental branch. My last incremental change was in recognition of the ease of getting old versions thanks to git and frequent well commented logs of what changed. So I decided that my old habit of leaving test prints, etc. embedded into the code, albeit commented out, was just making it harder to read the program. So I weeded that commented-out scaffolding out of the final version.

Premature optimization is a bad idea

One disapointment was I was hoping the list comprehensions would get interpreted faster than the for loops, albeit at the price of using more memory. But it runs no faster, maybe even 10ms slower, but that's in the noise range, so more tests would be needed to pin down what is actually happening. I did accidently give myself one clue. I added print statements to show the type of the grammar variable and of the grammar.parsefile(...) result that I had to convert to a vanilla list before I could sort the list. I was lazy, so rather than tear apart the return(list(grammar.parsefile(...))) so I'd have an intermediate result that I could pass to type() to see what it was, I left the return statement alone and preceded it with a print statement that changed list(...) to type(...). That meant I parsed the file twice. The total CPU time nearly doubled, so without resorting to grand profiling tools (which I confess I don't yet know anyhow - and profiling a 320ms run is not such a great thing to do!), but we have some evidence that the vast majority of the run time is going into pyparsing as it reads and parses the 5000 or so names in the input file. So hoping that tweeking the other loops and argument passing would improve CPU time was naive on my part. Oops.

How many times do I have to hear measure first before you try to optimize, before it really sinks into my brain?

Oh, and the type of the grammar is "pyparsing.And", which agrees with the top level of delimited list generating an expression using the overloaded + operator. I'm quite certain that the type of "grammar" is a subclass of parser element, so a grander grammer could be built up with this as just one part of it, Mercifully, I didn't need anything grander.

The type of grammar.parseFile(...) is 'pyparsing.ParseResults". I didn't dig deeper to see why the thing looks like a list, but won't sort. It occurs to me that sort depends heavily on the comparison operators and presumably copy operators to swap values around. Perhaps pyparsing overloads those operators for its ParseResults type. If and when future problems cause me to revisit pyparsing, I'll take a closer look.

The entire listing now handily fits on a single page.


I ran the file through sloccount and it says 15 non-comment, non-blank lines of source code. It also estimated that creating 15 lines of Python code like this would take about 0.65 months, which I figure is 2-3 weeks, which is scarily close to right. But if I wasn't taking the time to learn a new language and its modules and git, all at the same time, I'd like to believe I could knock out 15 lines of code in much less time.

(Of course, working on my own, my time is not getting sapped into Affirmative Action meetings, design reviews, questions from the peanut gallery and all those other things that motivated me to work at home when I was doing real design or real coding. So maybe in a real corporate environment, that 2-3 week estimate still would apply, even for a seasoned Pythonista, which I don't claim to be yet!)

My one regret is that I'm still not doing test-driven-development.

Python has the annoying property of accepting just about anything at "compile-time" and not telling you of problems until you drive execution into the problem code at runtime. That makes it super important to have excellent test cases to drive the execution into as many nooks and crannies of the code as you reasonably can manage to do.

My programs don't have built-in self-testing capability to watch out for any regressions as changes are made. I'll have to force myself to try that stuff in some future problems. At this point, I'm not even sure where the test code would go. A separate file in place of "__main__"? And, just to keep life interesting, Python has multiple unit-test modules available, so picking one intelligently will take some work. One of the unit-test packages is called "nose". If I pick that one will you go "Ewww!"?