230 likes | 368 Views
Resource discovery. Crawling on the web. With millions of servers and billions of web pages, the problem of finding a document without already knowing exactly where it is is like the proverbial search for a needle in a haystack.
E N D
Resource discovery Crawling on the web
With millions of servers and billions of web pages, the problem of finding a document without already knowing exactly where it is is like the proverbial search for a needle in a haystack.
When the Internet was just getting started, and the number of sites and documents relatively small, it was an arguably easier task to publish lists of accessible sites, and at those sites provide lists of available files. The information was relatively shallow, based upon file names and perhaps brief descriptions, but a knowledgeable user was able to ferret out information with a little bit of effort.
However, the rapid growth of networked resources soon swamped early cataloguing efforts, and the job became too big for any one individual. The emergence of the World Wide Web created a demand for a more comprehensive form of cataloguing, and an on-line version of a library card catalogue.
When the Yahoo general index site first appeared, many users were confused: what was the purpose of the site? It didn’t provide any content of its own, but rather consisted almost entirely of links to other sites. However, the utility of such an index quickly became apparent.
But the size of the web quickly exceeded the capacity of humans to quantify it and catalogue its contents. The recognition of this fact sparked research into means by which the web could be searched automatically, by the computers, using robot-like programs specifically designed to explore the far regions of this ethereal world and report back their findings.
The early automated efforts to explore the web were described as “robots”, akin to those used to explore the solar system and outer space. However, such programs were soon rechristened in a more web-like manner, and became known as “spiders” and “web crawlers”.
Recall the observation made previously that if a file has any significance then someone will know of it, and include a link to that document from a web page, which in turn is linked to others. Under this assumption, any and every important file can eventually be tracked down by following the links.
The problem with this idea is that it involves a fair amount of redundancy, as some sites receive thousands of links from other sites, and the crawling process will take considerable time and generate massive amounts of data.
The program begins with a file or small group of files to initiate the crawling process. The first file is opened and every word examined to see if it fits the profile. If the word (or, more accurately, string of characters) fits the profile and is thus recognized as a candidate file name (or, later, a URL) then it is added to the list of further files to examine.
When the contents of the current file are exhausted, the program proceeds to the next file in the list, and continues the candidate file name detection process, and in so doing “crawls” through the accessible files. If the file does not exist then the candidate string is discarded and new candidate is taken from the front of the list.
It is interesting to note that this approach does not guarantee that every file will be examined; only those that have been mentioned in other files will find their way on to the list of files to be examined. Indeed it is possible that the crawler will find no candidates in the original file and the crawling will cease after examining only one file.
The crawler as described in the example above doesn’t actually do anything but crawl through the files; there is no indexing that might facilitate a subsequent search inquiry. Crawling and indexing features are integrated in the next example, thus producing a small-scale example of a search engine.
As each new word is extracted from a file, it is first examined to determine whether it is a candidate to be added to the list of “places to visit”. If so, then a hash table is checked to see if the candidate has been encountered before. If it has not been seen previously, then it is added to the list of “places”.
Words are added to a tree of indexed words … it could be a term-document matrix, or an inverted index of words and their association with the originating file names.
Is it necessary to index all words? For the purposes of a search program, the answer is no. There are many words and entire parts of speech that provide no added value to a search inquiry as they are too common to provide any qualitative distinction to the search. Thus, in our example, a word should first be checked against an “exclusion list” and indexed only if it is not excluded!
Crawlers and spider programs can wreak havoc as they move through a web site, analyzing and indexing the information found there. The traffic generated by the crawler can potentially disrupt the server, creating the indexing version of a “denial of service” attack on the site. The situation gets much worse if multiple crawlers are visiting simultaneously, or if the crawlers visit the site on a routine basis to maintain “fresh” information.
There is also the problem whereby crawlers visit sites that would prefer not to be indexed, as the information found there is perhaps of purely local interest, or is private, or the contents volatile enough that it would not be reasonable or useful to index the information.
The problems associated with inappropriate crawler “behavior” led to the articulation of an informal community standard dubbed “A Standard for Robot Exclusion”[1]. The solution strategy is quite simple: each server should maintain a special file called “robots.txt” (robots being synonymous with “crawler” in this respect). [1] The robot exclusion protocol can be found at http://www.robotstxt.org/wc/norobots.html.
A compendium of the latest news and developments pertaining to web search engines can be found at the Search Engine Watch site: http://searchenginewatch.com.
A consideration of the problem of prioritized crawling can be found in the paper “Efficient Crawling through URL Ordering”, by Cho, Garcia-Molina, and Page, at http://www7.scu.edu.au/programme/fullpapers/1919/com1919.htm.
A thorough (but slightly technical) introduction to the issues and challenges of web searching can be found in the paper “Searching the Web” by Arasu, Cho, Garcia-Molina, Paepcke, and Raghavan: http://dbpubs.stanford.edu:8090/pub/2000-37.
The paper that outlined the original strategy for what became the Google search engine is “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Brin and Page: http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm.