Unveiling the Web: Discovering Resources with Crawling

Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems bergmark@cs.cornell.edu CS502 Web Information Systems

Web Resource Discovery • Finding info on the Web • Surfing (random strategy; goal is serendipity) • Searching (inverted indices; specific info) • Crawling (follow links; “all” the info) • Uses for crawling • Find stuff • Gather stuff • Check stuff CS502 Web Information Systems

Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. CS502 Web Information Systems

Crawlers and internet history • 1991: HTTP • 1992: 26 servers • 1993: 60+ servers; self-register; archie • 1994 (early) – first crawlers • 1996 – search engines abound • 1998 – focused crawling • 1999 – web graph studies • 2002 – use for digital libraries CS502 Web Information Systems

So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT CS502 Web Information Systems

The Central Crawler Function Server 3 queue Connect a Socket to Server; send HTTP request Server 2 queue URL -> IP address via DNS Wait for the response: An HTML page Server 1 queue CS502 Web Information Systems

Document seen before? Process this document Handling the HTTP Response Extract text FETCH No Extract links : : CS502 Web Information Systems

LINK Extraction • Finding the links is easy (sequential scan) • Need to clean them up and canonicalize them • Need to filter them • Need to check for robot exclusion • Need to check for duplicates CS502 Web Information Systems

Update the Frontier URL1 URL2 URL3 : FETCH PROCESS FRONTIER CS502 Web Information Systems

Crawler Issues • System Considerations • The URL itself • Politeness • Visit Order • Robot Traps • The hidden web CS502 Web Information Systems

Standard for Robot Exclusion • Martin Koster (1994) • http://any-server:80/robots.txt • Maintained by the webmaster • Forbid access to pages, directories • Commonly excluded: /cgi-bin/ • Adherence is voluntary for the crawler CS502 Web Information Systems

Visit Order • The frontier • Breadth-first: FIFO queue • Depth-first: LIFO queue • Best-first: Priority queue • Random • Refresh rate CS502 Web Information Systems

Robot Traps • Cycles in the Web graph • Infinite links on a page • Traps set out by the Webmaster CS502 Web Information Systems

The Hidden Web • Dynamic pages increasing • Subscription pages • Username and password pages • Research in progress on how crawlers can “get into” the hidden web CS502 Web Information Systems

MERCATOR CS502 Web Information Systems

Mercator Features • One file configures a crawl • Written in Java • Can add your own code • Extend one or more of M’s base classes • Add totally new classes called by your own • Industrial-strength crawler: • uses its own DNS and java.net package CS502 Web Information Systems

The Web is a BIG Graph • “Diameter” of the Web • Cannot crawl even the static part, completely • New technology: the focused crawl CS502 Web Information Systems

Crawling and Crawlers • Web overlays the internet • A crawl overlays the web seed CS502 Web Information Systems

Focused Crawling CS502 Web Information Systems

1 2 3 4 X X 5 R Focused Crawling 1 2 3 4 5 6 7 R Focused crawl Breadth-first crawl 1 CS502 Web Information Systems

1 2 3 4 X X 5 R Focused Crawling • Recall the cartoon for a focused crawl: • A simple way to do it is with 2 “knobs” CS502 Web Information Systems

Focusing the Crawl • Threshold: page is on-topic if correlation to the closest centroid is above this value • Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than this value CS502 Web Information Systems

Illustration Corr >= threshold 1 Cutoff = 1 2 3 4 555 5 X 6 7 X CS502 Web Information Systems

Closest Furthest CS502 Web Information Systems

Correlation vs. Crawl Length CS502 Web Information Systems

Fall 2002 Student Project Centroids, Dictionary Term vectors Collection URLs Query Centroid Collection Description Mercator Chebyshev P.s HTML CS502 Web Information Systems

Conclusion • We covered crawling – history, technology, deployment • Focused crawling with tunneling • We have a good experimental setup for exploring automatic collection synthesis CS502 Web Information Systems

http://mercator.comm.nsdlib.org CS502 Web Information Systems

Unveiling the Web: Discovering Resources with Crawling

Unveiling the Web: Discovering Resources with Crawling

Presentation Transcript

Crawling the Hidden Web

Web Crawling

Scalable Web Crawling and Basic Transactions

Automatic Causal Discovery

Web Crawling

Web Crawling

Web Crawling

CRAWLING THE WEB

Data Collection and Web Crawling

Crawling the Hidden Web

Deep Web Crawling and Mining

Ch. 8: Web Crawling

Recent Results in Automatic Web Resource Discovery

Web Crawling and Data Gathering

Datahut - Web Crawling Services

Deep-Web Crawling and Related Work

Ch. 8: Web Crawling

Deep Web Crawling

User-Centric Web Crawling*