Sunday, January 13, 2013

Udacity Unit 4 - Index data structure

In CS101, Unit 4, the notion of data structures is expanded beyond simple strings and simple lists. In Lesson 4-7, the web crawler is expanded to be a function:


where index is the data structure that is passed to the web page finder to find pages based on keywords and "seed" is a URL where the web crawler should start its crawl.

One student asked in the forum "What is index to be?". Fair 'nuff question, and they specifically added they didn't want an answer that solved the quiz of unit 4-7, just to understand what is wanted. To see the original forum thread, see:

Here's what I replied:

The crawl_web function is supposed to construct the index data structure that your search engine will use to find web pages. Truth is that between 4-7 and the end of the course, there will be evolution of the details of that structure. I suggest you re-watch the earlier lessons in unit-4, especially 4-2.

I'm not sure I understand your notation in this question, but as a broad hint I'll tell you that the whole content of a web page doesn't belong in the index. crawl_web's job is to extract from the web pages the information that you'll need to again find the pages that are relevant to some search word. Think about the index of a book. Say you want to find mentions of eggplant in the book. The book's index is an alphabetized list of important words in the book and for each word gives you a list of page numbers where that word was mentioned. Web pages don't have page numbers, but do have URL's. So what you want from the index data structure is something that your search function can use to tell at a glance what the URL's are for web pages that mention the search term you are looking up. (eggplant, perhaps).

At this point in the course, there are only a small number of data structures to consider. There are strings and lists, and you can compose grander things (lists of strings, lists of lists, ....). You've seen that the content of a web page can be regarded as a big string. With a bunch of effort by your program, you can rummage through that string to find keywords and links (URL's). You don't want to have to go thru all that work each time you just want to look up eggplant, so the idea is to do the work once, and capture a summary in the index data structure of what you saw. Think ahead to looking things up. What kind of index would make that job easiest?

I hope that helps. A marvelous thing about Python is that a function is able to return as its result a large complicated data structure. Other languages would insist that a function can only return simple stuff, such as a number. If you find yourself forced to work in a more old-fashioned programming language some day, you'll have a deeper appreciation of Python's generous approach.