Introduction to part 2 and Spoiler alert
In my previous blog post, I showed my first solution to Project Euler problem 22. For those of you who want to solve the problem on your own, I cautioned that my posting was a "spoiler". In today's article, part 2, I refine the code to use a Python object of my own. I believe the result is better code, but again, a "spoiler" if you'd rather develop your own independent implementation. Sorry, but if you want to remain untarnished by my prior art, please don't read any futher in this article and do not follow any of the links to the source code under discussion.
A classier solution to Problem 22
Well this has been fun! My improved version of problem22 is here: http://pastebin.com/agBBSQ9Z
In the previous version of the code, you may remember I had main() initialize valuedict so the value dictionary could be passed in to the namescorer routine. As I mentioned in part 1, I wasn't entirely happy with that arrangment as the valuedict was clearly an internal implementation detail of the namescorer routine. So, why should main() even know about that stuff?
The first big change was to add a NameScorer class. When you instantiate it, you tell it the alphabet of symbols that the NameScorer object will be asked to process. The first symbol is counted as value 1, the 2nd symbol is given a value of 2, etc. The __init__ routine of NameScorer builds the valuedict, which when you come down to it, is just an implementation detail of how we compute the score for a name. Assuming ASCII and using the ord(c)-ord('A') approach is another way to do it. The new approach removes the valuedict from main(). That looked out of place in main() and I either had to make it global (which is considered bad form) or do what my previous version did: pass the valuedict in to the nameScore routine each time it was called!. Happy to be rid of that.
What I learned as I wrote my first class is that a self. qualifier is very important when trying to refer to a symbol that is part of the instance of the class. Multiple times the tests aborted with a message that there's no such global symbol as valdict. I'd have to correct the code to say self.valdict and hit my forehead while exclaiming Doh! I kind of wish for a short cut like import has so you don't have to repeatedly qualify a symbol reference with the module name.
The next big change didn't actually survive to the final version. I had "for" loops that iterated over the alphabet or the name list, but I also had a counter that had to tick along with each iteration. That meant an initialization statement before the "for" and an incrementing statement in the body of the loop. There isn't wonderful documentation of the syntax in the Python books I have here, but you can say
for (i, name) in
and thus have both the names and the counter automatically
handled by the "for". That left almost nothing inside the "for" and it
was then that I realized, the initialization of the value dictionary,
could be handled by a "list comprehension" generating the tuples that
then just become an argument to dict(). And the totaling of the weighted name
scores in main could be done by invoking
sum(i*ns.score(name) for (i, name) in
Best explanation of this that I've since found is Ned Batchelder's blog post
"Loop like a native".
The books I have here say that current Python implementations implement
list comprehensions like that by actually generating the list of tuples
and iterating over that. Apparently the intent is that some day the
tuples will be generated as needed. But heck. Memory is plentiful.
zip by the way is a function that given 2 lists, generates tuples by
pairing together the i-th element in one list with the i-th element in
the other list. It only generates as many tuples as the shorter of the
2 lists, but both my lists were the same length.
Somewhat invisible in the code is that while making these changes,
I learned a few more basic tricks with the git version control system.
I checked in yesterday's problem22.py to git. That version was implicitly
the master branch.
Then I created an experimental branch (yeah, this is overkill for a one
person operation, but remember I'm doing this to learn). Each time I
had a working change to problem22.py, I'd check it into the experimental
branch. My last incremental change was in recognition of the ease of
getting old versions thanks to git and frequent well commented logs of
what changed. So I decided that my old habit of leaving test prints,
etc. embedded into the code, albeit commented out, was just making it harder
to read the program. So I weeded that commented-out scaffolding out of
the final version.
How many times do I have to hear measure first before you try to optimize, before it really
sinks into my brain?
Oh, and the type of the grammar is "pyparsing.And", which agrees with the
top level of delimited list generating an expression using the overloaded
+ operator. I'm quite certain that the type of "grammar" is a subclass
of parser element, so a grander grammer could be built up with this as
just one part of it, Mercifully, I didn't need anything grander.
The type of grammar.parseFile(...) is 'pyparsing.ParseResults". I didn't
dig deeper to see why the thing looks like a list, but won't sort.
It occurs to me that sort depends heavily on the comparison operators
and presumably copy operators to swap values around. Perhaps pyparsing
overloads those operators for its ParseResults type. If and when future
problems cause me to revisit pyparsing, I'll take a closer look.
The entire listing now handily fits on a
single page.
(Of course, working on my own, my time is not getting sapped into
Affirmative Action meetings, design reviews, questions from the peanut
gallery and all those other things that motivated me to work at home when
I was doing real design or real coding. So maybe in a real corporate
environment, that 2-3 week estimate still would apply, even for a seasoned
Pythonista, which I don't claim to be yet!)
My one regret is that I'm still not doing test-driven-development.
Python has the annoying property of accepting just about anything at "compile-time" and not telling you of problems until you drive execution into the problem code at runtime. That makes it super important to have excellent test cases to drive the execution into as many nooks and crannies of the code as you reasonably can manage to do.
My programs don't have built-in self-testing capability to watch out for
any regressions as changes are made. I'll have to force myself to try
that stuff in some future problems. At this point, I'm not even sure
where the test code would go. A separate file in place of "__main__"?
And, just to keep life interesting, Python has multiple unit-test modules
available, so picking one intelligently will take some work. One of the
unit-test packages is called "nose".
If I pick that one will you go "Ewww!"?
zip(range(1,len(nameList)+1), namelist))
Premature optimization is a bad idea
One disapointment was I was hoping the list comprehensions would get
interpreted faster than the for loops, albeit at the price of using
more memory. But it runs no faster, maybe even 10ms slower, but that's
in the noise range, so more tests would be needed to pin down what is
actually happening. I did accidently give myself one clue. I added
print statements to show the type of the grammar variable and of the
grammar.parsefile(...) result that I had to convert to a vanilla list
before I could sort the list. I was lazy, so rather than tear apart
the return(list(grammar.parsefile(...))) so I'd have an intermediate
result that I could pass to type() to see what it was, I left the return
statement alone and preceded it with a print statement that changed
list(...) to type(...). That meant I parsed the file twice. The total
CPU time nearly doubled, so without resorting to grand profiling tools
(which I confess I don't yet know anyhow - and profiling a 320ms run is not
such a great thing to do!), but we have some evidence that the vast majority
of the run time is going into pyparsing as it reads and parses the 5000
or so names in the input file. So hoping that tweeking the other loops and
argument passing would improve CPU time was naive on my part. Oops.
Metrics
I ran the file through sloccount and it says 15 non-comment, non-blank
lines of source code. It also estimated that creating 15 lines of Python
code like this would take about 0.65 months, which I figure is 2-3 weeks,
which is scarily close to right. But if I wasn't taking the time to
learn a new language and its modules and git, all at the same time,
I'd like to believe I could knock out 15 lines of code in much less time.
No comments :
Post a Comment