Saturday, January 5, 2013

Why don't we use a bucket for every keyword?

As I explained in my first 2013 blog posting,'s CS101 course has a project running through it where you learn to program by building a web crawler and a search engine. The web crawler builds a data structure that the search engine then uses to find web pages. The initial implementation does sequential searches to find keyword matches in the data structure. Around unit 5 of the course, the scalability problem of that approach is pointed out, and hash tables are introduced as a more time efficient way to search for keyword matches. Hash tables have N "buckets" and a hash function that assigns any given keyword to a bucket. Sometimes, more than one keyword will hash to the same bucket so the hash table implementation has to know what to do with such collisions.

One student reasoned that if N buckets cuts down the search time to find a keyword match, that it would be even faster to allocate a separate bucket for each keyword. That question appears here:

So I explained why in my forum response:

Underneath it all, the computer memory is addressed numerically - the first storage location is numbered 0, the next is numbered 1 and so forth. Each storage location in the computer's memory typically can hold one character, so a string is a sequence of storage locations. If you set up a series of entries, each with a fixed length, and want to access the Nth entry, given the starting address of the series in memory, the length of the entry and N, it is a simple matter of arithmetic to compute the address of the Nth entry. But if you want the address of the entry for keyword "Eggplant", how do you map that string to a particular address? A hashtable is one technique. We treat the underlying representation of the letters in the string as numbers and run them through a function that computes a number that falls into a range (the number of buckets). The exact details of the computation can vary from one implementation to another, so long as the same result comes out for the same keyword each time for the same hash table.

Just now, for fun, I wrote a little program to tell me what 'Eggplant' works out to if you look at its underlying bits as a single number. In decimal that works out to: 5,139,324,068,239,476,530. 'Eggplant' isn't even a big word. How many buckets are you prepared to bring to the party? So we need a technique that'll reproducibly map that down to a number in a more convenient range and that's what we call a hash function.

There are other possible approaches than hashtables. b-trees, for instance. But that would take more explaining then I think would fit comfortably in this forum. In the end it all comes down to determining a numeric address that corresponds to the entry you seek.

There are programs that given a fixed list of words will create a "perfect hash function" - one with no collisions for the given words. Not applicable to the web search engine project, because the list of words isn't know-able in advance, but suppose you were writing a program to process Python programs. There is a fixed set of language keywords (if, else, print, def, ...) and an open-ended realm of possibilities for variable names. So a perfect hash to make it easy to look up the keywords might be useful, but the variable names will have to be handled separately.

So there are many ways to organize and store a collection of information. The amount of work to add an entry to the collection, the work to look up an entry and the space to store the collection vary with different techniques. Requirements vary, so good programmers need a large bag to hold all the possibilities they might pull out to apply to whatever problem. There are books where such things can be looked up and library modules often save you from having to reinvent things from scratch. Python embodies a tremendous long history of techniques and makes it look easy. Hurrah for progress and not needing to do pointer arithmetic. Indeed, even lists of variable length strings are pretty amazing when you think about what that involves behind the scenes.

Some day you may learn some less evolved programming languages such as C or assembler language where the underlying hardware is more visible to you. Nice place to visit, but keep your ruby slippers handy to take you back to the ethereal world of Python when you're ready to leave the registers and instruction codes behind. Layers and layers of abstraction, all of it know-able, but if you try to think at the machine code level all the time, you won't be very productive when working on some higher level problem.

The user gave my answer a "thumbs up" vote and "accepted" the answer as resolving his question. Valuable karma points. At least it made me feel I'd helped the guy out.