Monday, February 3, 2014

Python Generators


Back in May, 2013, I posted "Cheap, but not Free, Modular Code has a Price". In that blog article I compared the performance of my solution of Project Euler problem 14 to the performance of Nico Ekkart's solution of that same problem. As I explained in that article last May, Project Euler's guidance on performance is that a solution should be able to complete on a typical PC in less than a minute. My initial solution ran well over a minute and I applied a cache to speed my program up. Once I'd gotten it down below the one minute limit, I had concluded "good enough" and went on to other problems. But Nico honed his solution a little tighter and outperformed my program by a factor of 4.

Looks to me that we had basically the same solution, but his was more of an in-line implementation, while I used a Python generator to produce the sequences of numbers the processing of which is the heart of problem 14. I concluded that the Python structuring facilities like generators do add overhead but I opined that the added clarity of packaging the sequence generation as a "generator" made the overhead worthwhile. Nico's code for his cache was so little code that I didn't really defend my use of a Python "class" for my cache implementation. Some further experiments with the code to determine how much of my extra overhead came from the generator vs. how much came from my cache implementation might be interesting. I've posted the source code and invite you to experiment and share your findings.

But that isn't what I came here to talk about today

This blog post today is aimed at expanding your knowledge of Python Generators and share with you a neat trick that I've recently learned: a generator of dictionaries.

Python Dictionaries

Back in January, 2013, I posted "Dictionary Implementation". In that article I pointed to where the Udacity CS101 MOOC introduces Python dictionaries and pointed to further places to look to learn how Python implements dictionaries, hash tables. If you aren't comfortable with Python's notion of a dictionary, I'll sum it up as an easy to use collection of name-value pairs, much like an array is an easy to use collection of subscript-value pairs.

Given an integer value as a subscript into an array, you can easily ask for the value associated with that subscript. (In Python such an array is called a "list". But if you have a dictionary of name-value pairs, in Python it is easy to ask for the value associated with a name. e.g. What value is associated with "eggplant", where "eggplant" in this example is a possible "name" entry in the dictionary. If this is all still mysterious to you, my suggestion is to dig for more info on the web. Google search is your friend. A Google search for:

python dictionary

turned up more than 5,000,000 web pages today and Google looks to have done a decent job of sorting the most helpful ones to the front of the search results.

Or perhaps you should work your way through Udacity CS101 to learn this concept of dictionaries and more.

Python Generators

But as I already said, the topic of my blog post today is Python Generators. Rather than take the time and trouble to create a fresh new tutorial about Python Generators, I'll pause here to provide a link to an article from April 2013 from Jeff Knupp: "Improve your Python: 'Yield' and Generators Explained". That article resonated with me as written in an easy to understand way. It's about 19 screens long, so it isn't a quick read, but you can get back to Facebook and YouTube later.

Knupp's article should certainly get you to the point of being comfortable with my Project Euler problem 14 generator. My generator is very vanilla, just conjuring up a sequence of numbers. The one tricky aspect is my program often doesn't pursue the sequence to the end, but sometimes decides it needs to see a new sequence. I was a little nervous when I did it that way, but now understand that Python doesn't mind dangling generators like that, and even should know enough to be able to garbage collect their saved state, if memory gets tight and it is clear that there's no way the program could get back into that old sequence. Powerful stuff and a feature unlike anything available available in C, Fortran, Basic or Cobol to name 4 examples of older more traditional programming languages.

But Wait! There's more!

If you've read this far but are yawning, thinking that hardly ever do you, yourself, have occasion to use complex logic to generate a sequence of numbers in the programs that you write, well hang on and see what else generators are good for.

Igor Vasilcovsky recently pointed out to me an interesting tutorial from David Beazley about generators. The meat is the slide deck under the presentation link on that page. Even with 2 slides to a page, that slide set runs to 79 pages so it probably is too much to absorb in one sitting. So try to work through it in several sessions. Here's my summary of the key elements I think you should get from wading through that presentation:

  1. A generator can return any kind of Python object. That is, it isn't limited to returning the numbers of a sequence. If you paid close attention to my Pythonic Python from July, 2013, you may remember that one of the points I made there was that Python functions can return large complex objects, not limited to just a number or just a pointer. Better still, Python takes care of the book keeping to automatically garbage collect the returned object if you are no longer using it.

    Beazley has examples with a generator returning strings (lines from log files), and then takes an unexpected turn to have the generator parse the strings into values and stash the values into a dictionary with a name for each value and then return the resulting dictionary as the generator's output. (A dictionary per line of the log file!) My initial reaction to that was to be horrified at the overhead, but if you think about it, looking up a name in a dictionary is quite central to Python's run time modus operandi, so dictionary objects being generated is perhaps more general than generating some more specific Python class object, and looking up values by name is pretty much the same as using a name of a member of a class object to get to a value in the class object. So given that Python is able to get decent performance from run-time interpretive execution, generating dictionaries is a lot more sensible then my initial gut reaction took it to be.

  2. A generator is just Python software so it can do anything that any Python software might want to do to provide the next value in the sequence. It can compute the next value using a formula, or it could access a file to learn the next value. It is even able to make use of another generator to get an input sequence that it manipulates and passes along. So generators can be chained together into long pipelines.
  3. If your program is basically about iterative processing of a series of stuff, then chaining together pipelines of generators can fundamentally shape a clean structure for your program and with a little practice you'll soon have a collection of useful reusable filters that you can use to construct future programs that you haven't even thought of yet. Share them with your friends. If you own the code, consider making it (and it's documentation) GPL licensed and freely shared on Github or some other such public place.
I'm seeing hints on the Internet that future Python releases will have even more use of generators. e.g. standard approaches built in to the language to process streams of events. e.g. Brett Cannon's 40 minute talk: "Python 3.3: Trust me, it is better than 2.7". He foreshadows a Guido Van Rossum talk about Python 3.4 scheduled for that same conference. If you can find a link to a video of Guido's talk about 3.4 and new roles for generators, please share it in the comments section here.