Friday, December 21, 2012

Concerns of Real Search Engine Developers

Concerns of Real Search Engine Developers

Udacity updates the course material from time to time. A unifying thread that ran through the course was development of a web search engine. Some while after I'd completed working through unit 6, they added a set of Q&A sessions. Q&A 6-6 is here:

Q&A 6-6 is a discussion of the current concerns of real search engine developers.

After viewing that clip, it occurred to me that a major concern of real search engine developers hadn't been included. That led to an exchange of e-mail with the professor, and at his request, I posted that exchange to the CS101 forum.

The original forum thread is here:

Here is what I posted:

I sent some e-mail recently to Professor Evans:

I just stumbled upon Q&A 6. I guess you guys added it since I'd already completed chapter 6. Q&A 6-6 about what are the challenges for search engines these days could use a little expansion. A major challenge that I don't think you mentioned is how to keep the search engine attractively useful while also figuring out how to make a buck. A recent discussion on facebook started out with a friend mentioning he'd recently purchased some fresh turmeric root and was wondering what to do with it. I suggested he just Google for:
turmeric root recipe

and pick what he wanted from there. I even pointed to a specific page that was early in the search results and looked like a good answer to his question. But the thread degenerated into a lot of gripes about how Google clutters things up with ads, and wishing that someone would do better. Not sure if Facebook will block you from accessing it, but here's the URL for that thread: That discussion is now up to 54 replies. Clearly a topic about which people have strong opinions. (To be fair, Barry's friends probably are more geek- inclined than a random sample of the population would be. As evidence I offer 2 examples of search terms mentioned in the gripes:


double walled heat shrink tubing


Prof. Evans graciously replied:

Thanks, Drew. Yes, there is definitely a lot more that goes into this, and a lot of things lead to conflicts between commercial interests and good search results. DuckDuckGo seems to do pretty well on these searches.

Happy to have discussion on this in our forums.


--- Dave

I'm happy to open the matter up to forum discussion as he suggests. How are the search engines doing, in your opinion, at balancing the interests of providing good search results (to keep us users happy) vs. making a buck to keep the lights on (and the stock-holders happy)?


The posting stirred up much less controversy on the forum than I'd expected. Sentiment among the CS101 students who bothered to respond seemed to be that Google was doing a decent job of balancing promotional info and search results, and that DuckDuckGo was a comparatively weaker search engine.

Sunday, December 16, 2012

CS102, let's make it happen!

One of the CS101 students posted a question to the forum asking what to study beyond CS101. Here's a link to that original thread:

The question asked if an Algorithms course should be next in the sequence.

My reply wasn't in terms of actual Udacity courses, but attempted to describe what I believe to be a reasonable curriculum path to pursue the study of computer science, starting with a CS102 to cover more of Python and Object Oriented Programming and Test Driven Development than CS101 covered. Of all my postings to the CS101 forum, this was the one that got the most thumbs-up votes from folks. My reply (which actually pre-dates my reaching the end of CS101):

That depends. A Donald Knuth-style algorithms course is going to have a fair amount of math involved. A student just out of CS101 might need more time to work through enough math courses to be ready for it. Not sure how that would work in a course-at-a-time program like Udacity. I could imagine a course that rolls the teaching of the math in with the teaching of algorithms.

An Artsy next course might have a title like Computers and Society. What can computers do vs. what can people do? How do computers effect society (e.g. economics, privacy, security, censorship, cryptography, digital rights, intellectual property, ...). I remember taking such a course and nicknaming it Computers: Threat or Menace. Assigned reading included "The Moon is a Harsh Mistress". Somewhere in the Pythonic humor of the question is asked "Whatever made you think that 'import skynet' was the right thing to do?"

Another possible direction for the curriculum would be to introduce another programming language. Looking at Cornell U's catalog, I see it now sports CS202 Transition to Java. A more general variation on this would be something I'd call Comparative Programming Languages. Not much interesting to say on Python vs. Perl vs. Ruby, but Python vs. Java vs. C++ vs. Haskell vs. Erlang would make for an interesting series of lectures on automatic memory management vs. do-it-yourself memory management, interpreted vs. compiled, hard typing vs. duck typing, imperative programming vs. functional programming, and how to get along with multi-core processors, among major topics. Library differences are a much more subtle topic. There still may be times when Fortran is the right choice. A Java course would be appropriate if the curriculum was headed in the direction of teaching you how to make stuff for the Web. These days, I would hope it would also at least touch on Javascript too. Despite the mention of J-A-V-A in most names, Java is not Javascript. The Comparative Programming Languages course is a pre-req to learning to implement programming languages (Compilers, Interpreters, and Assembler, Oh my!).

Yet another direction is Computer System Architecture. What is CPU clock speed? Why have cache? what is the difference between 64-bit and 32-bit machines? Internals of data representation? (Floating point vs. decimal vs. binary. ASCII vs. Unicode). Stack machines vs. register machines. Real memory vs. virtual memory. RAM vs. disk vs. tape vs. solid-state. Fundamental limits on storing bits. Parity bits. Error correcting codes. Why are matters of power and cooling important? Quantum computing? Arguably this is a more hardware oriented course than most, but no soldering is required. It's more at the level of concepts and block diagrams and a little bit of Physics. A case can be made that knowledge of computer system architecture is a pre-req for really appreciating issues of multi-core. Computer System Architecture is certainly needed to be able to appreciate an Operating Systems Principles class where the task at hand is to manage and share out the components provided by the hardware. File systems, for example, presumably would be covered in the OS class, but hard to do in a sensible way if you don't know seek time, rotational latency, and so forth. Understanding of Computer System Architecture is also probably needed before you'll be comfy in a Compilers course.

Udacity's CS101 (the end of which I have not yet reached) seems to under teach data types, object oriented programming and such. So, there is certainly room for a CS102 that continues to teach Python and maybe gets into test-driven-development. It would expose you to more algorithms too, which would be good prep for a subsequent algorithms and data structures course.

Hey, computer graphics, data networks, computer data bases and collaborative software development (working as a member of a team) are other topics that don't really fit into any of the above courses.

Bottom line conclusion: a comprehensive study of computer science doesn't really fit into a 4-year program. Maybe that's why there are Master's degrees - to give you another year or 2 to take it all in. Fascinating field.

Donald Knuth set out many, many years ago to write a book "The Art of Computer Programming". He outlined it as 7 chapters. It's been his life's work ever since, and he's only up to Volume 4A so far. Promises to be a 7 volume (or so) set. And that doesn't count the separate textbook "Concrete Mathematics", which expanded the overly terse math review of chapter 1 into a whole book of its own (By Ron Graham, D. Knuth and O. Patashnik).

I hope that helps. I didn't dig into Udacity's actual course offerings and certainly don't know what direction they have in mind. A lot of "what should come next" depends on where you are trying to go. Lots of people can bang out code to print the payroll and know nothing of Directed Acyclic Graphs. In my opinion, "university level courses" are not "trade school" and should aim to convey broad knowledge more than specific skills. Udacity should not be the "Close Cover Before Striking" School of Computer Programming, as advertised on matchbook covers.

Were it up to me, I think I'd provide a CS102 to follow CS101, and next offer an Algorithms plus Data Structures course that embeds the teaching of enough math to hold it together. Then Computer Architectures followed by Comparative Programming Languages next. But I can already hear whining: 5 courses into a Computer Science curriculum and I still haven't been shown how to implement an Ajax application! Life is hard.

My background is in engineering, so I'm probably underestimating the importance of a softer Artsy Computers and Society course. A future senator would have much more use for the material in a Computers and Society course than in mastering test-driven-development or Agile programming.

If you were looking for different information than this, please post again to clarify what you seek.

Sunday, December 9, 2012


A student in Udacity cs101, a seeker of truth, posted a question to the forum asking what is actually true and what is false in Python. To read the original question, see:

Here is a reposting of my reply to that question:

Your last example:

c = []
random_operation(c) #we do not know if this operation 
                    #  actually adds to list 'c'
if c:               #we only want to execute the following 
                    #  if list 'c' is not empty
    return c[2]     #because if it is empty this will return 
                    #  an IndexError: out of range
troubles me. The if you show doesn't really assure that return c[2] will avoid raising an IndexError. A better guard test would be to ask if len(c) > 2 so you can be sure that c[2] isn't past the end of c.

Somehow your contemplation of what is true, reminded me of this old Gahan Wilson cartoon: Is Nothing Sacred? though if your goal is mastery of Python and not so much mastery of philosophy, there may be more to gain in contemplating this stackoverflow question and its diverse answers: Python elegant assignment based on true false values.

I was surprised at how much digging it took to find a definitive specification of exactly what is true and what is false in Python. The answer I found is here: Python truth values.

By providing the interactive Python interpreter via the course's browser window Udacity avoids saying much about versions of the Python language, perhaps avoids too much - but you should be aware that while there is much discussion of Python 2 vs. Python 3 on the web, that the big deal there is that there were changes made to the language in Python 3 that would break old code (that is, the new version was not upward compatible with the old), but within the sequence of Python 2.x releases, there have also been interesting evolutionary changes to the language.

In looking into things for your question tonight, one surprise was how recently True and False were added to the Python language: PEP 285 - A boolean type, new in Python 2.3. And the language continues to evolve. e.g. PEP 308 - Conditional expressions , new in Python 2.5. But I've never seen a statement of this form in actual use:

x = true_value if condition else false_value
except tonight I gave it a try:
print 42 if towels > 0 else None
It actually seems more reminiscent of Perl than Python, but if you are running a 2.5 or newer version of Python, apparently you are entitled to write code in that form. If you have a towel, the answer is 42, but if towels is 0, the answer is None...

Which brought me to wondering exactly what version of Python is in use in Udacity's CS101. Adding this code to a homework problem:

import sys
print sys.version
got me the answer to that question:

2.6.6 (r266:84292, Sep 12 2011, 14:03:02) 
[GCC 4.4.5 20110214 (Red Hat 4.4.5-6)]
So, it's a pretty recent version (2.6.6, from August 2010) of Python 2, but it is not 2.7 (not 2.7.3, nor 2.6.8 either, both of which were released in April 2012, if you want to look closer still). If you're curious to know what is the difference between 2.6 and 2.7, see: What's new in Python 2.7. Release 2.6.8 is a maintenance release to provide some security fixes. The history of the dictionary hashing problem is interesting reading, and shows how subtle a flaw can be a security hole these days.

Saturday, December 8, 2012

Yucca Mountain Post-Mortem

It is clear to me that the US needs more domestic sources of energy. Fracking to get more natural gas, perhaps at the price of ruining underground aquifers and with the obvious consequence of increased carbon emissions, looks like a bad choice. More coal and more drilling for domestic oil isn't a big win either. Energy without increased carbon emissions means nuclear, but Obama seems to have kept his 2008 promise to oppose that. (Obama and obstruction have a nice alliterative compatibility). I haven't seen any serious growth in research of alternatives to nuclear fission (nuclear fusion is at least a hope, but it seems to not be on an urgent-need fast-track). Here's an article that got me pissed off at both major political parties, but mostly at the Democrats. The Republicans angered me by not speaking plainly with a counter-proposal even though the alternative will draw clear opposition. They campaigned on more domestic hydrocarbons as the solution for the next 100 years. Humbug!

Not clear to me what was so wrong with Yucca Mountain, Hanover Reservation, and middle-of-nowhere Texas. But, if there's not going to be a domestic solution to the problem, how about negotiating internationally? A highway to far northern Canada, perhaps? Maybe even build some large nuclear power stations up there and powerlines to ship the power back to the US? Arguably, power lines are less environmentally risky than really long oil pipelines. Are the deserts of Mexico geologically stable enough to be worth considering for nuclear waste disposal? Are we sure we've completely evaluated the recycling possibilities of the "spent" fuel? If it's going to be dangerous for 10,000 years, doesn't that imply there's additional energy that could somehow be extracted from the spent fuel? Sounds like a topic that could put some scientists and engineers to work towards creating a solution where so far the money has just been going to lawyers and politicians to oppose progress.

May 10, 2016 Update

3+ years later and I don't think I've really found the right places to publicize this particular blog article of mine. Google's stats for the page say it has only had 28 page views so far. Seems that fracking may be contributing to a surprising increase in the number of earthquakes in the US midwest. I've seen nothing to make me think I should embrace fracking as a good idea. If anything, fracking now is looking to be an even worse idea than I'd thought it was. I hope you smirk as much as I did at this official US gov denial of there being any connection between fracking and earthquakes. Myths and Misconceptions. They explain that it isn't fracking that is causing the earthquakes, it is waste water disposal that is causing the earthquakes, and unless you read the piece carefully, you might not even notice where all that wastewater is coming from. It comes from the oil wells that the fracking made profitable, and for whatever reason, Uncle Sam let's them pump the hazardous waste water into deep wells where they promise it won't bother anybody (unless you are the namby pamby sort of person who gets upset by, say, earthquakes or the prospect of that hazardous waste water mixing in with your drinking water well-water some day). How about requiring the hazardous water be processed to make pure water and some reduced quantity of hazardous waste that then has to be disposed of in some safe (though perhaps expensive) manner. Given ten thousand years, nuclear waste at least decays, eventually leaving just "rocks". How long does fracking waste need to be safely contained and monitored before it loses its toxicity and other dangerous properties? You say my proposal for handling the fracking waste seriously reduces the profitability of fracking? Aww... Break my heart. It's nature's way of telling you something's wrong... <- a song from 1970. Everything old is new again? <- a song from 1978

I did recently come across a related article from Boston and am encouraged to learn there are people in the nuclear power business who are seriously pursuing the notion that there is much more energy available to extract from today's nuclear waste. A New Way to Get Nuclear Power, Reducing Waste in the Process. A pleasant surprise is that the proposed reactor not only can be fueled with "spent" fuel, from today's reactors, but that in the process of extracting more energy from that spent fuel, although it still eventually results in its own "spent fuel" nuclear waste to dispose of, but the half-life of the twice-spent waste is merely hundreds of years, not ten thousand years. Sounds like their molten salt reactor (a revival of a 1960's design idea?) is many years away from becoming a real part of our energy supply, but it at least let's me feel like less of a crack-pot. (In case it wasn't already obvious to you, I am not a nuclear engineer at all. I have an engineering education, but went into the business of software development, which I might term "software engineering", but there are "real" engineers out there who would regard that as a controversial claim. An essay topic for another day...).

So, please, publically speak up in opposition to fracking, and in favor of non-carbon energy sources. If you can help bring this little essay to the attention of some more people so more people would actually read this, that would make me feel like I at least did my part.

Wednesday, December 5, 2012

Where to get Python?

In the Udacity CS101 forum, a frequent question from students has been where to get Python for their own computer. For example, see

Here is my reply to that question:

It's called an interpreter in Python's case, not a compiler. A true compiler translates source code to executable machine code. More dynamic languages, e.g. Python, Perl and Java, translate the source code into a more compact form and then have an interpreter that runs the program for you. The exact meanings of even "simple" stuff like the + operator depends on the types of the operands and the types in Python might not be known until runtime. The traditional thinking is that an interpreted language (e.g. Python) is doomed to have worse performance than a compiled language (e.g. C), but it turns out that's a lot more interesting an argument than it was just a few years ago. If you want to take a deep dive into a technical computer science discussion of this matter, I suggest you spend some time watching Dynamic Languages Strike Back. It's a very technical talk, and I, for one, had to look up quite a few things mentioned in the talk. Don't let that throw you. Take notes while you view the video and, later, Google up more info on the topics you hear mentioned but didn't really know. If you are interested in computer science, I assure you its a worthwhile talk.

Getting to your question: The version of Python used in udacity's CS101 is Python 2.6.6. The most current version of 2.6 at this time is 2.6.8. There are multiple versions available and the details of how to install it depend on your choice of operating system. If you run Linux or a Mac, it is very likely that there's already a Python interpreter on your computer. If you are running Windows, there are choices to make. The cygwin package from provides Unix-like commands on Windows and optionally can install a nice Python environment on your Windows PC. But be forwarned that if you want to write software that is closely wedded to Windows, cygwin may not be the right choice. There are still more choices and most all of them are free. Google for

python windows

and it'll find lots of pages for you. There's a packaging called Portable Python that comes with lots of extra modules bundled in. You can install it on a USB flash-drive and run from there without having to really install anything on whatever Windows PC you walk up to. Just plug in your USB drive and "wherever you go, there you are". I mostly run Ubuntu Linux here, so I have little experience with Native Windows Python. So, I don't really know how hard it is to incorporate another module into the Portable Python environment. I did install Python and wxpython and the Eric IDE on my wife's Windows PC. Compared to the ease of pulling in optional packages on Ubuntu, Windows doesn't make it easy to pull your python environment together. Keep good notes of exactly what you install and where you got it from and what the steps were to install the pieces. Unless you have a specific piece in mind that doesn't come with Portable Python, I think I'd pick Portable Python if Cygwin isn't suitable for you.

Not entirely free, but with much more formal support available, there's a company called ActiveState that markets a rich bundle of Python and Python modules all integrated together and with commercial support. Probably not the right choice for an individual student or hobbyist, but if you wanted Python for Windows for a commercial application, ActiveState is certainly a company you'd want to talk to. Or, if you want to write for the Microsoft .NET environment, then by all means talk to Microsoft about Iron Python. Bring money if that's what you want, because it isn't free.

Another alternative to consider is Python in the Cloud. PythonAnywhere is built atop Amazon's cloud computing service. For small stuff, accounts are free. It's a reasonably rich "batteries included" environment. It has the advantage of nothing for you to install at your end, and the promise that if your application becomes a Farmville-like success, that the infrastructure can handle your needs (but at some point in your huge success, you will have crossed the line from "free" to you-need-to-pay-for-your-account). Google Apps is another way to do Python in the Cloud. I've barely dabbled in that...

And you probably thought you were asking a simple question. My apologies if my answer has muddied the water. If you want a simple answer for Windows, look at cygwin and at Portable Python. This may be a good time to wave bye-bye to Mr. Gates and move to Linux instead of Windows.

Does that help?


Since writing the above, I was pleasantly surprised to learn from other postings to the Udacity forum how many Python-in-the-cloud interpreters there are, at least for small development use. The list at may be helpful.

The visual debugging at is particularly special, at least if you are dealing with small amounts of code. For larger scale programs, particularly programs that import other modules, I stand by my suggestion of PythonAnywhere or GoogleApps.

What is Python good for?

Another student asked on the Udacity CS101 forum about what Python is good for. If you'd like to see the original question, see:

Here was my reply to that question:

The major limitation of Python traditionally is performance, but you can generally dodge that by coding the really hot loops of your program in compiled C routines that you invoke from Python. The great advantage of Python is that you'll have working code while the guy working in C++ or whatever is still debugging.

There are some notable Python features that the course didn't delve into: Python user-defined classes and modules - and a few data types: imaginary numbers, decimal numbers, rational numbers. The language has pleasant surprises in its lack of limitations.

It's really not that huge of a language so there's not all that much more for you to know to know the language, but behind the language there's a gi-normous collection of library routines for you to draw upon. Google is your friend for discovering what is out there, but whatever your field of endeavor, it is worthwhile to check to see what library code is out there to help you along. There are routines to help you write data parsers, routines to help you write AI programs, routines to help with math and the list goes on.

There's a style to pulling Python code together that you need to get used to if you want to consider yourself to be a Python expert. The gotcha is that Python will accept code that is garbage and not tell you about the junk being junk until you drive your program into trying to run the junk, so Python gets along well with test-driven-development. You need to develop test cases to drive the program though as much of the possible execution paths as you can. There are Python library modules to facilitate that approach to the code. Getting facile with test-driven-development is still on my to-do list, but I'm now taking CS258 from in hopes it will at least give me a sound appreciation of the limitations of the approach.

Moving further away from problems I understand and into problems I'm only aware are out there to grab me and thrash me about: I've heard that Python has problems with making use of n processors at once and that there's reason to believe that the future is in multi-core systems where n processors is going to feature bigger and bigger n as time goes by. There are at least theoretical reasons to believe that Haskell or perhaps Erlang are going to fit more comfortably into that world of the future.

If, like me, you enjoy learning new stuff, there's no reason to worry about running out of new stuff to learn! Yay for computer science!


Tuesday, December 4, 2012

Scooping the loop snooper

This was another discussion item I posted to the Udacity cs101 forum. To see the original post, see:

Your computer science education isn't complete until you've debugged an infinite loop or 2 or 3 or 4 or 5 or ... Perhaps you wondered why Python let's things run until a time limit is hit instead of detecting that your code has an infinite loop in it and issuing an error message before burning up a wad of CPU time. Well, it could try to do that, but here's proof that such detection can't work all the time. The problem is called the "Halting Problem". You can use Google search to find learned papers on the topic, but here's a simple proof in the style of Doctor Seuss.

Enjoy! Simple, but I do wish it had illustrations in his style. Where are Thing 1 and Thing 2 when you need them?


Monday, December 3, 2012

Is the Udacity CS101 course watered down?

April 21, 2016 note

This article was originally published on December 03. 2012. It's now more than 3 years later and this is one of the most popular articles on my blog, having accumulated more than 2000 pageviews, so I re-visited the article to see how it's holding up. Alas, the link ( to the original forum question in the opening paragraph no longer works as Udacity has re-done their forum. Worse still, far as I can tell, not all of the old forum Q&A items survived that re-do. I think I finally found a link that works. Sigh. And the forum re-do broke other links in this article too. I've done my best to replace them with updated links. If you find any problems with broken links or whatever be sure to leave a comment about the problem so I can do what I can to fix it.

I haven't retaken the CS101 course lately, but my understanding is that it now has some additional content and is now, nominally, a 3 month course. It is still self-paced and you can still take it for free, but they offer an option ("Udacity Connect") for $99/month to have face-to-face (Skype?) support, so you can personally ask questions, get human assessment of your work, have a sense of accountability for your progress. If you find you aren't as self-driven, self-motivated as you thought, perhaps that face-to-face option is worth $99/month to help keep you on track. It certainly is less than college tuition.

The original December 03, 2012 article, as amended

A Udacity CS101 student asked on the forum if the course is watered down compared to taking an introductory programming course from Prof. Evans on campus in Virginia. To see the original question, see: Is this course [CS101] watered down

Here is my reply to that question:

It's an 8-week course intended to fit into your life - that is, it isn't 8 weeks of intense focus every waking hour. It doesn't cover 100% of the Python language and Python environment, but it is not, IMHO, watered down. It covers what it intends to cover and does a good job of it. At the beginning of Unit 7, Dave gives a quick recap of what the course has covered. I won't rehash that here, but will try to describe from a different angle what you should get out of CS101:

  1. Python programming language. A start on programming, specifically in Python.
  2. Introduction to computer science. BNF, abstraction, recursion, iteration, even an introduction to performance analysis (e.g. linear run time, quadratic run time, ...)
  3. Research skills. Hone your use of Google or the search engine of your choice to find more information than the lectures have covered.
  4. Communication skills. The forum is an important part of the course. You should be asking and answering questions there. It should motivate you into working on those research skills mentioned as item 3.
  5. Time management. Much like getting through all the hoops to earn a college degree, getting a certificate of completion from CS101 shows that you've managed your time well enough to stay on track and keep moving forward for the 8 or so weeks. That's an important skill that will serve you well on the longer haul to a 4-year college degree.
  6. An idea of whether computer programming and computer science is a field that interests you enough to focus on it for years to come. If you hate it, you can always go in other directions (Hamburger U.?) and will only have sacrificed a fraction of 8 weeks of your time. I hope you find the course stimulating and enjoyable and that it leaves you hungry for more. I assure you there is no shortage of more.
Expanding a bit on items 3 & 4 from the above list:

One of the things you should master while you are in this course is how to dig up additional information from the web. This forum is one place to learn more than the lectures cover. In part, reading the forum helps. But to really get the most out of it, you should attempt to answer questions too. You don't need to know the answers off the top of your head. They may require some digging and testing. If you aren't really sure of the answer explain what you think and what you're still unsure of. Someone else might then be able to help you and the original poster too. One of the creators of the Stack Overflow web site (the software on which this forum apparently is founded) had a very good essay on his blog explaining the importance of communication skills in programming:

But what should you study next?:

Elsewhere in the CS101 forum is a discussion of what might come after CS101. This next write-up isn't in terms of actual Udacity course offerings, but more like wishing for what Udacity could offer in the future. You can read that here:

Let's make it happen - CS102. Or see my blog article of the same title: [Blogged edition] - Let's make it happen - CS102. Looking at the Udacity forum today [April 21, 2016], it appears that the CS102 discussion continued here:

Also elsewhere in this forum is a really good list of additional sites to read. You can read the list here:

I haven't read everything on that list (yet), but I do particularly commend the Peter Norvig essay:

I've been programming now for approximately 40 years and I guess I'm a little slow as I'm still finding things to learn. In my humble opinion, the field's endless supply of new things to dig into is what has held my interest for so long.

I won't dwell here on enumerating things outside the scope of CS101. Some time ago, when "learn the Python language" was an entirely not-done thing on my to-do list, I wrote a blog entry that was nominally about 'literate programming', but spun off into cataloging the many things that I thought I'd have to learn to feel I'd become an expert Python programmer. The language itself is just one of those things. If you'd like to read my blog entry, see:

I confess that despite this being several years later, I still feel I have a lot more to learn (though I'm happy with my progress and delighted with the language as a tool for software development). I've created a Pearltree of links to more info about software development and a sub-tree there is specifically about Python, largely shaped by that catalog of topics from my old blog posting. Here's a URL for the Python Pearltree:

I'm still dabbling in Python, but starting to think the next thing I should tackle is the quite different programming language, Haskell. At this time, "learning Haskell" is in the entirely not-done category of my to-do list. Here's the essay that convinced me it was worth learning yet another programming language:

I hope my reply here has convinced you that CS101 is time well spent even though it is well short of finishing a degree in Computer Science. Even a degree in Computer Science is, at best, just a start at becoming a really Good programmer.

'It's not just a job, it's an adventure.'


Sunday, December 2, 2012

Do your ancestors suffer from unplanned mutations?

This was a discussion item I posted to the CS101 forum. If you want to see the original posting see:

If you want to see the referenced homework problem, see:

I learned more than I intended from Homework6-1starred. I wrote up a note about it to my once-upon-a-time (well, 1973) college roommate and cc'ed it to Professor Evans. Professor Evans suggested that I'm probably not the only person to run into this snag, and suggested that I make a forum post out of my note. So, here it is.

Unit 6 in udacity's CS101 is basically about recursion. One of the homework problems gives a "dictionary" that is keyed by names and for each name gives a list of that person's parents. It asks me to write a procedure "ancestors", that given a name, returns a list of all the known ancestors of that person or an empty list if the genealogy dictionary gives no ancestors for that name. Sounds straight forward. Strangely, it said, the robo-grader wouldn't mind duplicates in the output and was indifferent to the order of the output list. The "indifferent to the order" part made sense. I could put 1 parent on the list and then the ancestors of that parent, followed by the other parent and then the ancestors of that other parent, or I could put both parents on the list and then the 2 sets of ancestors. So, various output orderings are possible, but duplicates? Unless there is something funky in the genealogy, how could there be duplicates?

So I tossed together a straightforward recursive procedure to return an empty list if the given person is not in the dictionary, and otherwise, put that person's parents onto a list I called thegang, and then for each parent call ancestors to get grandparents, etc. If a non-null list comes back, then thegang gets those additional ancestors tagged onto the list. Much to my surprise, my output differed from the expected results in that I had duplicates. Just to make it easier to check results I added a line of code to the part that adds the additional ancestors to thegang to skip ones that are already in thegang. My output now matched the expected results, but I wanted to know where those duplicates came from. The genealogy was the family of Lady Ada Lovelace, Lord Byron, et al.

Here's the version of the program that the robo-grader declared as "correct", when I submitted it, but that I now know is flawed.

def ancestors(genealogy, person):
    if person in genealogy:
        return []
    for p in parents:
        for name in q:
            if name not in thegang:
    return thegang

The "if name not in thegang" test explicitly rules out adding duplicates to the output.

Here's a listing of the program that produces duplicates in the output.

def ancestors(genealogy, person):
    if person in genealogy:
        return []
    for p in parents:
        for name in q:
    return thegang

Took a lot of experimenting to find the source of the duplicates, but in the end I was able to fix the program to get rid of the duplicates without the brute force approach (albeit "brute force" being one line of code thanks to Python) of checking for duplicates and not putting 2nd copies into the list. You'd have laughed if you'd watched me at the library here. More than once I stood up at my cubbie desk and silently declared to myself "This can't be!', walked around a little bit and then settled back down to experiment some more. In the end, the problem turned out to be aliased lists, like I was fretting about in recent e-mail.

Here's a listing of the debugged program,

def ancestors(genealogy, person):
    if person in genealogy:
        return []
    for p in parents:
        for name in q:
    return thegang

The significant change is "thegang=parents" became "thegang=list(parents)".

The problem was that I set parents by looking the person's name up in the genealogy dictionary. So, parents is an alias for the list of the person's parents. And then I initialize thegang by assigning parents to thegang, so now thegang is an alias too for the list of the person's parents as given in the genealogy dictionary. And then I do my recursions and add more names to thegang. But, when I do so, I'm packing those additional names into the original list of parents in the genealogy dictionary. So, when I am working on the next person, if there is any common ancestory, I don't get 2 parents, but a much longer list of parents. Fix is that when I initialize thegang, instead of saying thegang=parents, I instead say thegang=list(parents). That way, thegang isn't an alias of parents, but a copy of the parent's list, a fresh list of its own. Cheaper copy than you might think as the underlying strings are immutable and so get re-used without having to be copied themselves. That is, all that is copied are the 2 string references into a new list of those 2 string references.

But in thinking about it, it seems to me that the robo-grader should enforce a rule that thou shalt not screw up the genealogy data structure. Easy enough for it to have a private deep-copy of the original data structure and then compare the data structure in the end to the way it is supposed to be and see if anything changed. A similar watch for corruption of the index structure of the web search engine that is the recurring theme of the course would also be sensible.

Anyhow, it was a worthwhile mistake as it further educated me on the sort of bugs that can come from having aliased lists. Will I dodge the next such bug? Or at least be quicker to discover what is wrong? Well, time will tell....

Did you run into any such bug in working hw6star-1? Repeating a line I used on another question about lists, mutation of lists is a very powerful feature of Python, but with great power comes great responsibility.