Sunday, March 31, 2013

Creeping up on OOP in Python - Part 1

Background and Spoiler Alert

Project Euler is a challenge set of problems, hundreds of them, that you probably need a program to solve each problem. You are allowed to write the program in any programming language of your choice. Some of the problems require reading an input which is given in a file linked to the problem description page. All of the programs produce an answer - a single numeric output (sometimes the answer is a very big number with many digits in it). The site poses the problems and you get to tell the site the number that is the answer. The site provides no guidance beyond feedback as to whether your answer is right or wrong until you give it the right answer. A right answer unlocks access to a bulletin board where you can share how you solved the problem. Lots of bragging there from folks about how few lines of code they used or how real mathematical insight made the program so much easier than it appeared to be at first blush.

My background is software development. I've worked with mathematicians, but make no claim to being a mathematician myself. So solutions that are founded on real mathematical insight are not the sorts of things to expect from me. Sometimes, armed with the mathematical insight explained on the bulletin board, I've found it worthwhile to revisit my solution to make my code better. Sometimes I revisit my code with a critical eye just to make it better to satisfy myself. I do miss having talented co-workers who can review code and make suggestions at the lunch table. But as a retiree, it falls on me to be my own reviewer. Fortunately, I have always had strong opinions of the quality of code and a thick skin, so I can provide my own harsh comments to myself and get some cleaner code out of the introspection.

This series of articles is going to look at Problem 22 of Project Euler and is, quite frankly, a spoiler. If you'd rather do it yourself without seeing somebody else's solution, do not open any of the links to the code samples here, and do not read the rest of this blog post if you consider description of the code of a solution to be off-limits too. Before posting this article, I Googled for:

project euler problem 22 solution

and can see I'm by no means the only person breaking the wall and revealing a solution on the Web.

My motivation for tackling problems in Project Euler was to learn the Python programming language. Shortly before my retirement, David James, then of the Bell Labs statistics department of the Murray Hill Math research center, had suggested to me that Python looked like a great programming language to learn. I had no reason to doubt that, but was knee deep in projects in other languages, so I didn't have time just then to stop and learn something entirely new to me. Now that I'm retired, I have time for such adventures. Sadly, except for whatever value there is in telling the tales of such adventures here around the campfire, it isn't exactly clear what good will come from my learning yet-another-programming-language. If nothing else, it has certainly shown to me that David James has an eye for value in programming languages. His recommendation of Python appears to have been well founded indeed. I'm still a long way from becoming an expert in Python programming, but I find it is certainly a pleasant programming language to use to solve problems. Python code is readable (unlike, say, Perl or APL) and tends to be quite compact (unlike Java). The language supports a broad range of programming styles and brings more possibilities for control structures than any language I've worked in except maybe for assembler language, where the language brings no real structure to the program and anything goes that you can dream up - but document your assembler code well so the folks who come later can figure out what you did!

Old School: Structured Programming

My roots are in "structured programming" from the days when "Goto considered harmful" was still considered to be a controversial position. Happily, I've long since accepted the truth of that position and so don't miss at all that Python has no goto statement.

Quite some time ago, I laid out a list of things I believed I needed to master to consider myself a Python expert. Perhaps surprisingly, the language itself is only a small part of the things to learn. I blogged my list in an article that started out to be about "literate programming", but that mutated as I created that article into my Python study guide.

So, my solution of Project Euler problem 22 started out as a small, reasonably clean structured program.

The problem statement for the problem is here: http://projecteuler.net/problem=22, complete with a link to the names.txt file that the program is to process.

By the way, I've been working with Python 2.7. Availability of libraries for Python 3 has been my reason for hanging back with Python 2. As time goes by, that presumably will become less and less of a good reason.

Wrapper Main Program

Since all of the Project Euler problems produce a single numeric answer, I base each of my Project Euler programs on a wrapper main routine that invokes the code for the problem, expecting the called code to return a value. The wrapper main program has a print statement to print the returned value and infrastructure to time the runtime for the called routine and print a report of the runtime. My code for the wrapper main routine can be found here: http://pastebin.com/trk7E1n9. The wrapper main routine expects a single argument from the command line to tell it where the actual program code is located. The wrapper assumes that the program to be run is in the file named by the command line argument and that the name of the program to be invoked from that file is main().

I've not been entirely delighted with my wrapper main program. It reports elapsed time for the test, not CPU time. And the time routines from the library seem rather coarse in their granularity, but Project Euler's guidance on runtimes is only that the data for these problems shouldn't take more than a minute to process on a typical PC or you should rethink your algorithm. My timer code may be a little crude, but it can certainly show whether I've met that guideline.

I've written my program but should it take days to get to the answer?


Absolutely not! Each problem has been designed according to a "one-minute rule", which means that although it may take several hours to design a successful algorithm with more difficult problems, an efficient implementation will allow a solution to be obtained on a modestly powered computer in less than one minute.

If you really want to think about measuring time in your program, you ought to study this page: http://stackoverflow.com/questions/85451/python-time-clock-vs-time-time-accuracy

The Rest of the Problem Solution

My first-version code for problem 22 is here: http://pastebin.com/SgZ7y54b

There are commented-out print statements in the file that I used when I was doing my testing of the code.

One other tool that I learned to work with as part of my study of Python is the source-code-control tool git as provided by github. As I got used to the power of git, I found that the code really reads better if you remove the debugging print statements, not merely comment them out. But the edition of problem22.py that I'm showing is from earlier in the process than my reaching that realization.

The description of the input file for Problem 22 says the input is a sequence of names, each enclosed in double-quotes and separated from each other by commas. Now Python is great at that kind of parsing of input, but I dis-like writing and debugging the fiddly code needed to carefully parse an input file, particularly if the parser is to make reasonable recovery from errors in the input's punctuation. So, I invoked the computer science principle that "there's no problem so big or so complicated that it can't be run away from". My escape hatch was to apply the pyparsing library module. pyparsing is, in my opinion, very Pythonic in its approach. The library allows you to define a grammar in terms of Python objects and then apply that grammar against your input. There are library alternatives that may give better performance, but I found pyparsing to be reasonably well documented and a joy to apply to the straight-forward problem at hand. I ended up having to write delightfully little code to solve the problem. (I concede that later, as I worked through CS101 in Udacity.com that I came to understand that Python may be good enough at parsing input strings that perhaps I should have just gritted my teeth and written the straight-forward code to parse the input file. But I do suspect that familiarity with pyparsing will be useful to me in the future if I ever am faced with parsing a more complex input structure.

This pyparsing library was my first serious encounter with the objected-oriented-programming (OOP) aspects of the Python language. Python is gentle in providing for such modern stuff, yet not forcing you to really grok the feature before letting you write reasonable software in Python.

The Devil is in the Details

Python is a relatively small programming language. As I mentioned, I strongly believe that the Python language itself is only a small part of what I believe I have to learn to be a Python expert. Nevertheless, there were still details of the language that I sort of stumbled into understanding as I worked this problem. I kept notes that I'll share with you here...

The ParseInput routine effectively has only 4 lines of code:

    def ParseInput(filename):
         from pyparsing import (delimitedList, QuotedString)
         grammar=delimitedList(QuotedString('""))
         return(list(grammar.parsefile(filename, parseAll=True)))

The define statement is the boilerplate to define a function. The from import () statement is where I import the pyparsing module from the library. My first try was to say:

     from pyparsing import *

which I expected would import all the "public" names from the module into ParseInput's name space. That shortcut is frowned upon as a change in the library's module could bring a name conflict into my function, but I figured it would save me the trouble of having to name every little thing I wanted to try. To my surprise, the import * was rejected as only usable at a module level. I think its saying that if I'm writing my own module, I can import * all the names in another module, but if I'm just inside a function, I have to be more specific. Rats, shortcut denied. Next try I said:

     from pyparsing import (delimitedList, QuotedString, parseFile)

but it objected that I couldn't import parseFile like that. delimitedList and QuotedString are classes defined in the pyparsing module. They are subclasses of the ParseElement class. The advantage of importing the names into my function are that I can refer to the names without having to qualify the name with the pyparsing module name. pyparsing.delimitedList is kind of long winded in composing a grammar. parseFile though is different. It's a method (invocable function attribute) of the grammar. I'd have to dig into the source code to be sure I'm saying this right, but I think that the grammar built from the parserElements is itself a parserElement, so I believe parseFile and its brother, parseString, are methods that can be invoked for a parserElement.

In the 3rd line of code, I creatively name my grammar, "grammar". delimitedList defaults to a list of comma separated items. The grammar for each item is the argument you pass to delimitedList, so I passed in QuotedString('"'). The argument to Quoted String is the quote character it is to look for. Default is that it strips out the quotes and just returns the stuff enclosed in the quotes. My first try I didn't know that QuotedString required an argument, so I coded it as QuotedString() and got a message about a class being invoked with one argument where it expected two.

The object-oriented stuff in Python always implies an argument of "self", the object on which the method is being invoked. So, my missing argument was the missing "2nd" argument.

I mistakenly reasoned that maybe I should just pass in QuotedString without the (). That got me an obscure (to me) message that I can't "combine type 'type' with a ParserElement". I eventually figured out what it meant. QuotedString is a subclass of Token and Token is a subclass of ParserElement, but QuotedString with no arguments is a class definition, a type. That is type(QuotedString) is 'type'. I tried

     qs=QuotedString('"')

and got type(qs) is "QuotedString" and isinstance(qs, ParserElement) was true. Whew! Good thing this is all OpenSource, as I don't think I'd have made much progress at all if I couldn't follow Obiwan's advice and "Use the source, Luke".

delimitedList internally uses + to construct the grammar for what it is looking for. (pretty straight forward. It wants to find a match to the argument grammar, followed by a comma, which it suppresses [gobbles up], followed by another match to the argument grammar OR it will settle for a match to the argument grammar, not followed by the delimiter [to cover the last item in the list. No need for a trailing comma after the last item]. I'm oversimplifying as the list can have an arbitrary number of comma separated items in it, and optionally, you can name an alternative delimiter instead of sticking with the comma).

parserElement redefines the + operator to make it into an "and" concatenation of parserElements, but it didn't want to concatenate a type to a parserElement. Giving an argument to QuotedString causes an instantiation of the class so I have an object that is an instance of a ParserElement and so is acceptable to the overloaded + operator to concatenate that object to another ParserElement object.

The 4th line of code started out as:

     return(grammer.parseFile(filename, parseAll=True)

(The parseAll=True tells parseFile to throw an exception if the grammar doesn't successfully parse the whole of the input file. Default is it parses as far as the grammar will match in the file and leaves things such that another grammar can be applied next.

That return statement actually seemed to work, but the returned value wouldn't re-order when I invoked the sort method that is associated with lists. I used the type function to see what the type of the returned value was. It was a pyparsing.<something> type, which I'm guessing is a subclass of the built-in list type. It had no objection to my invoking a sort method on it, but it didn't sort the list. Solution was to explicitly convert the returned value to a plain old list using the list() function which you'll now find in that 4th line of code.

So 4 lines of code written, and I'm sure glad no one is measuring my lines of code per hour productivity.

nameScore Routine

The next function is nameScore. 5 lines of code that given a name computes a score for it. A=1, B=2, C=3, .... Score is the sum of the values for the characters in the name. Straight forward iteration over list(name). List knows how to bust a string up into the sequence of characters that I want to process one by one.

valuedict is a nifty Pythonic feature. Basically its a hash table. I wasn't quite sure where to define it, so I just put the initialization into main() and passed the valuedict as an argument to nameScore. Maybe if I was more object-oriented in my thinking, I'd have made nameScore into some kind of a class with an __init__ function that built that valuedict that stays with the nameScore object. In Part 2 of this article I'll go back and try it that way just to see what it looks like.

The names.txt file for this problem is fairly massive - 46KB containing 5000 names all on one very long line of text. For debugging, I constructed a smalllist.txt file that had only a few names in it. smalllist.txt is a small file in the syntax of the file given with the program. The real file had some 5000 names in it, all on one line. Looking at the discussion board for the problem, a disappointing number of people pre-processed the names.txt file (e.g. using Excel) to pull the words apart and alphabetize them. Some even left the words with the double quotes still there, just giving " a zero value in scoring the names.

main() Routine

That brings us to the main() routine. Line 30 runs ParseInput on the input file to get the list of names into nameList. Line 36 applies the sort method of lists to put nameList into alphabetical order. Line 46 to 50 initializes the valuedict. The given problem data only has upper case letters. I threw in values for lower case letters, numbers, and space and period so I could put "R. Drew Davis" into my smalllist.txt test data file. Lines 53-58 are the loop to tally up the weighted sum of all the name scores. And line 59 is where I returned the result. 14 lines of code in main(). Clear and straight-forward.

Object Lesson

The objective of this article was to get closer to understanding Object Oriented Programming and why it is useful in writing clean code. I haven't tried to provide a tutorial of the mechanics of using OOP in Python. Wesley Chun's book "Core Python Programming", 2nd edition, Chapter 13 is a decent reference for learning this stuff. Or if you are too cheap to invest in a reference book, dig into the Web with Google or your favorite search engine. There are plenty of web pages to be found that talk about Objects in Python.

In this article we looked at use of the pyparsing library's objects to define and apply a grammar. I haven't implemented an object and class on my own yet. We'll get to that in Part 2.

I'm still learning this stuff. If you spot anything that I've gotten wrong, please add a comment to let me know so I can correct both the article and my understanding of the topic. Thank-you.