290 likes | 301 Views
Web Crawling and Automatic Discovery. Donna Bergmark Cornell Information Systems bergmark@cs.cornell.edu. Web Resource Discovery. Finding info on the Web Surfing (random strategy; goal is serendipity) Searching (inverted indices; specific info)
E N D
Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems bergmark@cs.cornell.edu CS502 Web Information Systems
Web Resource Discovery • Finding info on the Web • Surfing (random strategy; goal is serendipity) • Searching (inverted indices; specific info) • Crawling (follow links; “all” the info) • Uses for crawling • Find stuff • Gather stuff • Check stuff CS502 Web Information Systems
Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. CS502 Web Information Systems
Crawlers and internet history • 1991: HTTP • 1992: 26 servers • 1993: 60+ servers; self-register; archie • 1994 (early) – first crawlers • 1996 – search engines abound • 1998 – focused crawling • 1999 – web graph studies • 2002 – use for digital libraries CS502 Web Information Systems
So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT CS502 Web Information Systems
The Central Crawler Function Server 3 queue Connect a Socket to Server; send HTTP request Server 2 queue URL -> IP address via DNS Wait for the response: An HTML page Server 1 queue CS502 Web Information Systems
Document seen before? Process this document Handling the HTTP Response Extract text FETCH No Extract links : : CS502 Web Information Systems
LINK Extraction • Finding the links is easy (sequential scan) • Need to clean them up and canonicalize them • Need to filter them • Need to check for robot exclusion • Need to check for duplicates CS502 Web Information Systems
Update the Frontier URL1 URL2 URL3 : FETCH PROCESS FRONTIER CS502 Web Information Systems
Crawler Issues • System Considerations • The URL itself • Politeness • Visit Order • Robot Traps • The hidden web CS502 Web Information Systems
Standard for Robot Exclusion • Martin Koster (1994) • http://any-server:80/robots.txt • Maintained by the webmaster • Forbid access to pages, directories • Commonly excluded: /cgi-bin/ • Adherence is voluntary for the crawler CS502 Web Information Systems
Visit Order • The frontier • Breadth-first: FIFO queue • Depth-first: LIFO queue • Best-first: Priority queue • Random • Refresh rate CS502 Web Information Systems
Robot Traps • Cycles in the Web graph • Infinite links on a page • Traps set out by the Webmaster CS502 Web Information Systems
The Hidden Web • Dynamic pages increasing • Subscription pages • Username and password pages • Research in progress on how crawlers can “get into” the hidden web CS502 Web Information Systems
MERCATOR CS502 Web Information Systems
Mercator Features • One file configures a crawl • Written in Java • Can add your own code • Extend one or more of M’s base classes • Add totally new classes called by your own • Industrial-strength crawler: • uses its own DNS and java.net package CS502 Web Information Systems
The Web is a BIG Graph • “Diameter” of the Web • Cannot crawl even the static part, completely • New technology: the focused crawl CS502 Web Information Systems
Crawling and Crawlers • Web overlays the internet • A crawl overlays the web seed CS502 Web Information Systems
Focused Crawling CS502 Web Information Systems
1 2 3 4 X X 5 R Focused Crawling 1 2 3 4 5 6 7 R Focused crawl Breadth-first crawl 1 CS502 Web Information Systems
1 2 3 4 X X 5 R Focused Crawling • Recall the cartoon for a focused crawl: • A simple way to do it is with 2 “knobs” CS502 Web Information Systems
Focusing the Crawl • Threshold: page is on-topic if correlation to the closest centroid is above this value • Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than this value CS502 Web Information Systems
Illustration Corr >= threshold 1 Cutoff = 1 2 3 4 555 5 X 6 7 X CS502 Web Information Systems
Closest Furthest CS502 Web Information Systems
Correlation vs. Crawl Length CS502 Web Information Systems
Fall 2002 Student Project Centroids, Dictionary Term vectors Collection URLs Query Centroid Collection Description Mercator Chebyshev P.s HTML CS502 Web Information Systems
Conclusion • We covered crawling – history, technology, deployment • Focused crawling with tunneling • We have a good experimental setup for exploring automatic collection synthesis CS502 Web Information Systems
http://mercator.comm.nsdlib.org CS502 Web Information Systems