430 likes | 525 Views
Focused Crawling and Collection Synthesis. Donna Bergmark Cornell Information Systems. Outline. Crawlers Collection Synthesis Focused Crawling Some Results Student Project (Fall 2002). Definition. Spider = robot = crawler
E N D
Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems CUL Metadata WG Meeting
Outline • Crawlers • Collection Synthesis • Focused Crawling • Some Results • Student Project (Fall 2002) CUL Metadata WG Meeting
Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. CUL Metadata WG Meeting
Crawlers – some background • Resource discovery • Crawlers and internet history • Crawling and crawlers • Mercator CUL Metadata WG Meeting
Resource Discovery • Finding info on the Web • Surfing (random strategy, goal is serendipity) • Searching (inverted indices; specific info) • Crawling (“all” the info) • Uses for crawling • Find stuff • Gather stuff • Check stuff CUL Metadata WG Meeting
Crawlers and internet history • 1991: HTTP • 1992: 26 servers • 1993: 60+ servers; self-register; archie • 1994 (early) – first crawlers • 1996 – search engines abound • 1998 – focused crawling • 1999 – web graph studies • 2002 – use for digital libraries CUL Metadata WG Meeting
Crawling and Crawlers • Web overlays the internet • A crawl overlays the web seed CUL Metadata WG Meeting
Crawler Issues • The web is so big • Visit Order • The URL itself • Politeness • Robot Traps • The hidden web • System Considerations CUL Metadata WG Meeting
Standard for Robot Exclusion • Martin Koster (1994) • http://any-server:80/robots.txt • Maintained by the webmaster • Forbid access to pages, directories • Commonly excluded: /cgi-bin/ • Adherence is voluntary for the crawler CUL Metadata WG Meeting
Robot Traps • Cycles in the Web graph • Infinite links on a page • Traps set out by the Webmaster CUL Metadata WG Meeting
The Hidden Web • Dynamic pages increasing • Subscription pages • Username and password pages • Research in progress on how crawlers can “get into” the hidden web CUL Metadata WG Meeting
System Issues • Crawlers are complicated systems • Efficiency is of utmost importance • Crawlers are demanding of system and network resources CUL Metadata WG Meeting
Mercator Features • Written in Java • One file configures a crawl • Can add your own code • Extend one or more of M’s base classes • Add totally new classes called by your own • Industrial-strength crawler: • uses its own DNS and java.net package CUL Metadata WG Meeting
Collection Synthesis • The NSDL • National Scientific Digital Library • Educational materials for K-thru-grave • A collection of digital collections • Collection (automatically derived) • 20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall CUL Metadata WG Meeting
Crawler is the Key • A general search engine is good for precise results, few in number • A search engine must cover all topics, not just scientific • For automatic collection assembly, a Web crawler is needed • A focused crawler is the key CUL Metadata WG Meeting
Focused Crawling CUL Metadata WG Meeting
1 2 3 4 X X 5 R Focused Crawling 1 2 3 4 5 6 7 R Focused crawl Breadth-first crawl 1 CUL Metadata WG Meeting
Collections and Clusters • Traditional – document universe is divided into clusters, or collections • Each collection represented by its centroid • Web – size of document universe is infinite • Agglomerative clustering is used instead • Two aspects: • Collection descriptor • Rule for when items belong to that Collection CUL Metadata WG Meeting
Q = 0.2 Q = 0.6 CUL Metadata WG Meeting
The Setup A virtual collection of items about Chebyshev Polynomials CUL Metadata WG Meeting
Adding a Centroid An empty collection of items about Chebyshev Polynomials CUL Metadata WG Meeting
Document Vector Space • Classic information retrieval technique • Each word is a dimension in N-space • Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001> • Normalize the weights Both the “centroid” and the downloaded document are term vectors CUL Metadata WG Meeting
Agglomerate A collection with 3 items about Ch. Polys. CUL Metadata WG Meeting
Where does the Centroid come from? ? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s CUL Metadata WG Meeting
Building a Centroid 1. Google(“Chebyshev Polynomials”) {url1 … url-n 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {u1 … un} do D download(url) V term vector(d) For each term t in V do If t not in H add it with value H(t) ++ 4. Compute tf-idf weights. C top 20 terms. CUL Metadata WG Meeting
Dictionary • Given centroids C1, C2, C3 … • Dictionary is C1 + C2 + C3 … • Terms are union of terms in Ci • Term Frequencies are total frequency in Ci • Document Frequency is how many C’s have t • Term IDF is as from Berkeley • Dictionary is 300-500 terms CUL Metadata WG Meeting
1 2 3 4 X X 5 R Focused Crawling • Recall the cartoon for a focused crawl: • A simple way to do it is with 2 “knobs” CUL Metadata WG Meeting
Focusing the Crawl • Threshold: page is on-topic if correlation to the closest centroid is above this value • Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff CUL Metadata WG Meeting
Illustration Corr >= threshold 1 Cutoff = 1 2 3 4 555 5 X 6 7 X CUL Metadata WG Meeting
Closest Furthest CUL Metadata WG Meeting
Collection “Evaluation” • Assume higher correlations are good • With human relevance assessments, one can also compute a “precision” curve • Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n. CUL Metadata WG Meeting
Cutoff = 0 Threshold = 0.3 CUL Metadata WG Meeting
Tunneling with Cutoff • Nugget – dud – dud… - dud – nugget Notation: 0 – X – X … - X – 0 • Fixed cutoff: 0 – X1 – X2 - … Xc • Adaptive cutoff:0 – X1 – X2 - … X? CUL Metadata WG Meeting
Statistics Collected • 500,000 documents • Number of seeds: 4 • Path data for all but seeds • 6620 completed paths (0-x…x-0) • 100,000s incomplete paths (0-x…x..) CUL Metadata WG Meeting
Nuggets that are x steps from a nugget CUL Metadata WG Meeting
Nuggets that are x steps from a seed and/or a nugget CUL Metadata WG Meeting
Better parents have better children. CUL Metadata WG Meeting
Using the Empirical Observations • Use the path history • Use the page quality - cosine correlation • Current distance should increase exponentially as you get away from quality nodes Distance = 0 if this is a nugget, otherwise: 1 or (1-corr)exp (2 x parent’s distance / cutoff) CUL Metadata WG Meeting
Results • Details in the ECDL paper • Smaller frontier more docs/second • More documents downloaded in same time • Higher-scoring documents were downloaded • Cutoff of 20 averaged 7 steps at the cutoff CUL Metadata WG Meeting
Fall 2002 Student Project Centroids, Dictionary Term vectors Collection URLs Query Centroid Collection Description Mercator Chebyshev P.s HTML CUL Metadata WG Meeting
Conclusion • We’ve covered crawling – history, technology, use • Focused crawling with tunneling • Adaptive cutoff with tunneling • We have a good experimental setup for exploring automatic collection synthesis CUL Metadata WG Meeting