Sunday, December 2, 2012

Do your ancestors suffer from unplanned mutations?

This was a discussion item I posted to the CS101 forum. If you want to see the original posting see: http://forums.udacity.com/questions/2017355/do-your-ancestors-suffer-from-unplanned-mutations#cs101

If you want to see the referenced homework problem, see: http://www.udacity.com/view#Course/cs101/CourseRev/apr2012/Unit/709001/Nugget/710001

I learned more than I intended from Homework6-1starred. I wrote up a note about it to my once-upon-a-time (well, 1973) college roommate and cc'ed it to Professor Evans. Professor Evans suggested that I'm probably not the only person to run into this snag, and suggested that I make a forum post out of my note. So, here it is.

Unit 6 in udacity's CS101 is basically about recursion. One of the homework problems gives a "dictionary" that is keyed by names and for each name gives a list of that person's parents. It asks me to write a procedure "ancestors", that given a name, returns a list of all the known ancestors of that person or an empty list if the genealogy dictionary gives no ancestors for that name. Sounds straight forward. Strangely, it said, the robo-grader wouldn't mind duplicates in the output and was indifferent to the order of the output list. The "indifferent to the order" part made sense. I could put 1 parent on the list and then the ancestors of that parent, followed by the other parent and then the ancestors of that other parent, or I could put both parents on the list and then the 2 sets of ancestors. So, various output orderings are possible, but duplicates? Unless there is something funky in the genealogy, how could there be duplicates?

So I tossed together a straightforward recursive procedure to return an empty list if the given person is not in the dictionary, and otherwise, put that person's parents onto a list I called thegang, and then for each parent call ancestors to get grandparents, etc. If a non-null list comes back, then thegang gets those additional ancestors tagged onto the list. Much to my surprise, my output differed from the expected results in that I had duplicates. Just to make it easier to check results I added a line of code to the part that adds the additional ancestors to thegang to skip ones that are already in thegang. My output now matched the expected results, but I wanted to know where those duplicates came from. The genealogy was the family of Lady Ada Lovelace, Lord Byron, et al.

Here's the version of the program that the robo-grader declared as "correct", when I submitted it, but that I now know is flawed.

def ancestors(genealogy, person):
    if person in genealogy:
        parents=genealogy[person]
    else:
        return []
    thegang=parents
    for p in parents:
        q=ancestors(genealogy,p)
        for name in q:
            if name not in thegang:
                thegang.append(name)
    return thegang

The "if name not in thegang" test explicitly rules out adding duplicates to the output.

Here's a listing of the program that produces duplicates in the output.

def ancestors(genealogy, person):
    if person in genealogy:
        parents=genealogy[person]
    else:
        return []
    thegang=parents
    for p in parents:
        q=ancestors(genealogy,p)
        for name in q:
            thegang.append(name)
    return thegang

Took a lot of experimenting to find the source of the duplicates, but in the end I was able to fix the program to get rid of the duplicates without the brute force approach (albeit "brute force" being one line of code thanks to Python) of checking for duplicates and not putting 2nd copies into the list. You'd have laughed if you'd watched me at the library here. More than once I stood up at my cubbie desk and silently declared to myself "This can't be!', walked around a little bit and then settled back down to experiment some more. In the end, the problem turned out to be aliased lists, like I was fretting about in recent e-mail.

Here's a listing of the debugged program,

def ancestors(genealogy, person):
    if person in genealogy:
        parents=genealogy[person]
    else:
        return []
    thegang=list(parents)
    for p in parents:
        q=ancestors(genealogy,p)
        for name in q:
            thegang.append(name)
    return thegang

The significant change is "thegang=parents" became "thegang=list(parents)".

The problem was that I set parents by looking the person's name up in the genealogy dictionary. So, parents is an alias for the list of the person's parents. And then I initialize thegang by assigning parents to thegang, so now thegang is an alias too for the list of the person's parents as given in the genealogy dictionary. And then I do my recursions and add more names to thegang. But, when I do so, I'm packing those additional names into the original list of parents in the genealogy dictionary. So, when I am working on the next person, if there is any common ancestory, I don't get 2 parents, but a much longer list of parents. Fix is that when I initialize thegang, instead of saying thegang=parents, I instead say thegang=list(parents). That way, thegang isn't an alias of parents, but a copy of the parent's list, a fresh list of its own. Cheaper copy than you might think as the underlying strings are immutable and so get re-used without having to be copied themselves. That is, all that is copied are the 2 string references into a new list of those 2 string references.

But in thinking about it, it seems to me that the robo-grader should enforce a rule that thou shalt not screw up the genealogy data structure. Easy enough for it to have a private deep-copy of the original data structure and then compare the data structure in the end to the way it is supposed to be and see if anything changed. A similar watch for corruption of the index structure of the web search engine that is the recurring theme of the course would also be sensible.

Anyhow, it was a worthwhile mistake as it further educated me on the sort of bugs that can come from having aliased lists. Will I dodge the next such bug? Or at least be quicker to discover what is wrong? Well, time will tell....

Did you run into any such bug in working hw6star-1? Repeating a line I used on another question about lists, mutation of lists is a very powerful feature of Python, but with great power comes great responsibility.

Drew