Focused Crawling and Collection Synthesis

Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems CUL Metadata WG Meeting

Outline • Crawlers • Collection Synthesis • Focused Crawling • Some Results • Student Project (Fall 2002) CUL Metadata WG Meeting

Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. CUL Metadata WG Meeting

Crawlers – some background • Resource discovery • Crawlers and internet history • Crawling and crawlers • Mercator CUL Metadata WG Meeting

Resource Discovery • Finding info on the Web • Surfing (random strategy, goal is serendipity) • Searching (inverted indices; specific info) • Crawling (“all” the info) • Uses for crawling • Find stuff • Gather stuff • Check stuff CUL Metadata WG Meeting

Crawlers and internet history • 1991: HTTP • 1992: 26 servers • 1993: 60+ servers; self-register; archie • 1994 (early) – first crawlers • 1996 – search engines abound • 1998 – focused crawling • 1999 – web graph studies • 2002 – use for digital libraries CUL Metadata WG Meeting

Crawling and Crawlers • Web overlays the internet • A crawl overlays the web seed CUL Metadata WG Meeting

Crawler Issues • The web is so big • Visit Order • The URL itself • Politeness • Robot Traps • The hidden web • System Considerations CUL Metadata WG Meeting

Standard for Robot Exclusion • Martin Koster (1994) • http://any-server:80/robots.txt • Maintained by the webmaster • Forbid access to pages, directories • Commonly excluded: /cgi-bin/ • Adherence is voluntary for the crawler CUL Metadata WG Meeting

Robot Traps • Cycles in the Web graph • Infinite links on a page • Traps set out by the Webmaster CUL Metadata WG Meeting

The Hidden Web • Dynamic pages increasing • Subscription pages • Username and password pages • Research in progress on how crawlers can “get into” the hidden web CUL Metadata WG Meeting

System Issues • Crawlers are complicated systems • Efficiency is of utmost importance • Crawlers are demanding of system and network resources CUL Metadata WG Meeting

CUL Metadata WG Meeting

Mercator Features • Written in Java • One file configures a crawl • Can add your own code • Extend one or more of M’s base classes • Add totally new classes called by your own • Industrial-strength crawler: • uses its own DNS and java.net package CUL Metadata WG Meeting

Collection Synthesis • The NSDL • National Scientific Digital Library • Educational materials for K-thru-grave • A collection of digital collections • Collection (automatically derived) • 20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall CUL Metadata WG Meeting

Crawler is the Key • A general search engine is good for precise results, few in number • A search engine must cover all topics, not just scientific • For automatic collection assembly, a Web crawler is needed • A focused crawler is the key CUL Metadata WG Meeting

Focused Crawling CUL Metadata WG Meeting

1 2 3 4 X X 5 R Focused Crawling 1 2 3 4 5 6 7 R Focused crawl Breadth-first crawl 1 CUL Metadata WG Meeting

Collections and Clusters • Traditional – document universe is divided into clusters, or collections • Each collection represented by its centroid • Web – size of document universe is infinite • Agglomerative clustering is used instead • Two aspects: • Collection descriptor • Rule for when items belong to that Collection CUL Metadata WG Meeting

Q = 0.2 Q = 0.6 CUL Metadata WG Meeting

The Setup A virtual collection of items about Chebyshev Polynomials CUL Metadata WG Meeting

Adding a Centroid An empty collection of items about Chebyshev Polynomials CUL Metadata WG Meeting

Document Vector Space • Classic information retrieval technique • Each word is a dimension in N-space • Each document is a vector in N-space Example: <0, 0.003, 0,0,.01, .984,0,.001> • Normalize the weights Both the “centroid” and the downloaded document are term vectors CUL Metadata WG Meeting

Agglomerate A collection with 3 items about Ch. Polys. CUL Metadata WG Meeting

Where does the Centroid come from? ? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s CUL Metadata WG Meeting

Building a Centroid 1. Google(“Chebyshev Polynomials”)  {url1 … url-n 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {u1 … un} do D  download(url) V  term vector(d) For each term t in V do If t not in H add it with value H(t) ++ 4. Compute tf-idf weights. C  top 20 terms. CUL Metadata WG Meeting

Dictionary • Given centroids C1, C2, C3 … • Dictionary is C1 + C2 + C3 … • Terms are union of terms in Ci • Term Frequencies are total frequency in Ci • Document Frequency is how many C’s have t • Term IDF is as from Berkeley • Dictionary is 300-500 terms CUL Metadata WG Meeting

1 2 3 4 X X 5 R Focused Crawling • Recall the cartoon for a focused crawl: • A simple way to do it is with 2 “knobs” CUL Metadata WG Meeting

Focusing the Crawl • Threshold: page is on-topic if correlation to the closest centroid is above this value • Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff CUL Metadata WG Meeting

Illustration Corr >= threshold 1 Cutoff = 1 2 3 4 555 5 X 6 7 X CUL Metadata WG Meeting

Closest Furthest CUL Metadata WG Meeting

Collection “Evaluation” • Assume higher correlations are good • With human relevance assessments, one can also compute a “precision” curve • Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n. CUL Metadata WG Meeting

Cutoff = 0 Threshold = 0.3 CUL Metadata WG Meeting

CUL Metadata WG Meeting

Tunneling with Cutoff • Nugget – dud – dud… - dud – nugget Notation: 0 – X – X … - X – 0 • Fixed cutoff: 0 – X1 – X2 - … Xc • Adaptive cutoff:0 – X1 – X2 - … X? CUL Metadata WG Meeting

Statistics Collected • 500,000 documents • Number of seeds: 4 • Path data for all but seeds • 6620 completed paths (0-x…x-0) • 100,000s incomplete paths (0-x…x..) CUL Metadata WG Meeting

Nuggets that are x steps from a nugget CUL Metadata WG Meeting

Nuggets that are x steps from a seed and/or a nugget CUL Metadata WG Meeting

Better parents have better children. CUL Metadata WG Meeting

Using the Empirical Observations • Use the path history • Use the page quality - cosine correlation • Current distance should increase exponentially as you get away from quality nodes Distance = 0 if this is a nugget, otherwise: 1 or (1-corr)exp (2 x parent’s distance / cutoff) CUL Metadata WG Meeting

Results • Details in the ECDL paper • Smaller frontier  more docs/second • More documents downloaded in same time • Higher-scoring documents were downloaded • Cutoff of 20 averaged 7 steps at the cutoff CUL Metadata WG Meeting

Fall 2002 Student Project Centroids, Dictionary Term vectors Collection URLs Query Centroid Collection Description Mercator Chebyshev P.s HTML CUL Metadata WG Meeting

Conclusion • We’ve covered crawling – history, technology, use • Focused crawling with tunneling • Adaptive cutoff with tunneling • We have a good experimental setup for exploring automatic collection synthesis CUL Metadata WG Meeting

Focused Crawling and Collection Synthesis

Focused Crawling and Collection Synthesis

Presentation Transcript

FOCUSED CRAWLING

Web Crawling

Crawling

Data Collection and Web Crawling

Crawling

Crawling and Ranking

Focused Crawling for both Topical Relevance and Quality of Medical Information

Crawlers and Crawling Strategies

Exploiting Inter-Class Rules for Focused Crawling

Crawling

Crawling

Data collection, synthesis, and products

Policy Search for Focused Web Crawling

Adaptive Focused Crawling

CRAWLING

Adaptive Focused Crawling

Crawling

Accelerated Focused Crawling Through Online Relevance Feedback

Geographically Focused Collaborative Crawling

Crawling

Crawling, Ranking and Indexing

Crawling and Ranking