Tuesday, April 23, 2013

Robot Wars

Udacity.com CS101 has a unifying theme of building a web crawler and search engine as you learn Python. It's a well chosen project because it introduces HTML, and string processing and then segues into iteration, functions and recursion. As the index grows larger, data structures, hashing and the Python dictionary type are introduced, and improving the quality of search results gets some attention.

This article started out as a discussion thread in the CS101 "forum", but has been substantially re-worked and updated for this blog post. The original thread is here: http://forums.udacity.com/questions/100000043/robot-wars#cs101.

Having learned how to construct a web crawler in this course, I was amused in December to learn there's a company out there intent on protecting the web sites of their customers from being crawled. See: http://www.distil.it/the-dirty-secrets-about-robot-txt/#.ULo0_4c8CSo. A web site may post a robots.txt file on the root of its site to request web crawlers to stay out. But it's a request with no teeth. An ill-mannered or unscrupulous web crawling program might just ignore the robots.txt file and explore pages it was asked not to. The goal of Distil Networks is to block such crawlers. They don't say much about how they'd distinguish the web crawler traffic from real user traffic. Usage patterns, geographic location by IP address, that sort of stuff.

Well, I guess there's risk in publishing anything. IMHO, the web is for sharing information publicly. Hardly the right place to post anything you don't want others to see. The worry seems to be that if the info is too easily copyable by a crawler, that someone will duplicate your site and somehow profit from your work without compensating you.

Seems to me that a crawler that could alert you to duplicates of your page content would be useful. Of course, it would be pretty hard to do as the style wrapped around it could have changed so recognizing "your" content wouldn't necessarily be easy. It's a similar problem to those sites that do plagiarism detection in homework assignments.

Of course, if people are gathering up info from your site and not republishing it, but are finding a way to profit from the knowledge you shared, that seems fair. If they are spamming your customers, then maybe your site needs to keep customer contact info a bit closer to the vest. I believe there was an old business practice of salting confidential lists of customers with fake entries with phone numbers to trap any outsider trying to mine a stolen copy of the list. Similarly, I've heard tell of mapmakers putting some deliberate harmless error in their maps, so a copy could be recognized. Arguably, Google took somewhat that same approach to prove that Microsoft's Bing was copying Google search results. Microsoft, of course, strongly denied the allegation, but the matter did get the attention of even the Comedy Central cable TV network: http://mashable.com/2011/02/04/colbert-bing-google/.

Meanwhile, I think I see coming an escalating battle between sites and crawlers and expect I'll hate repeated interactions like captchas to see if I'm human, or complicated enough javascript tests to see if the query came from a real web browser, but slowing down the time to get to see the web content of interest and just raising the bar on how much like a browser the behavior of a web crawler needs to be.

To be fair, I freely admit that I have not tried the services of distil.it, so I have no real idea of how transparent their protections perhaps are, or any clue as to how easy is it for a sneaky web crawler to work around the restrictions. If you have any actual experience, good or bad, with such protections, please add some comments to this post to get the rest of us up to speed.