190 likes | 326 Views
Topical Crawlers for Building Digital Library Collections. Presenter: Qiaozhu Mei. 1. J.Qin et al. Building Domain Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method
E N D
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei
1. J.Qin et al. Building Domain Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method • 2. G.Pant et al. Panorama: Extending Digital Libraries with Topical Crawlers • (JCDL 2004)
Outline • Problem Description • Research Background • Their Approaches • Designing Classifier • Enhancing Meta-Search • Identifying Communities in Collection • Experiments • Discussion
Problem Description • Problem: Collect domain specific documents from the Web and manage the literature collection. • Topical Crawlers (Focused Crawlers) are designed to collect domain specific docs • How to bridge the gab of Web communities.
Digital Library vs. Search Engine • Digital Library • Domain specific, serving for literature study • High Quality • Topical Crawler + Collection management • Knowledge discovery • Search Engine • General, serving for web search • High Quantity • General Crawler + Online retrieval • Indexing, retrieving performance, etc.
Domain Definition Expand training set, Get starting url BFS, Best first search, Tree pruning, Multiple starting urls, Tunneling VSM, Naïve Bayesian, SVM, etc. TF-IDF,K-Mean, etc Page Rank, HITS, etc. Research Background
Design Classifier (Pant et al.) • Motivation: define the domain and distinguish relevant & non-relevant documents • Approach: • Query Google with title & reference to construct positive/negative example set (training set) • Use Vector Space Model to represent documents, use TF-IDF as term weights • Use Naïve Bayesian Classifier to estimate Pr(c+|q), which is used for ranking
Design Classifier (cont.) • TF-IDF weighting:
Enhancing Meta-Search (Qin et al.) • Motivation: Solve the limitation of Local Search algorithm in Crawling, bridge distributed web communities • Approach: • Manually provide domain specific queries • Query Meta-search Engine to get multiple starting urls.
Identifying Communities in Collection(Pant et al.) • Motivation: analyze the latent structures in collection, summarize and represent potential communities • Approach: • Use k-mean for content clustering • Use HITS for structural clustering • Label clusters by TF-IDF filtering
Experiments (Qin et al.) • Experiments Design • Compare with Google and a Domain Specific SE. • 996028 pages, 1/3 from meta-search method. • pre@20. • Compare meta-search enhanced crawling with traditional one, by means of precision. • 997632 pages from baseline method. • pre@10 • Experts define queries and judge results.
Experiments (Qin et al.) cont. • Experiment Result 1: • Their approach: • General SE: • Domain Specific: • Meta-search enhanced method better than general search engine and traditional domain specific search engine. • Experiment Result 2 • Expert ranking results in range 1-4 • Meta-search Enhanced: 2.77 • Baseline: 2.51 • In Top 100 results from Meta-search Enhanced collection: • from meta-search: 3.22, rest: 2.61
Experiments (Pant et al.) • Experiments Design: • Test Bed: from CiteSeer. 94 papers as initial documents. • Use one (expanded by querying Google) for building positive example set, 93 for building negative example set. • Compare with a BFS crawler. • Harvest rate:
Experiments (Pant et al.) cont. • Experiment results: • InterWeave: A middleware system for distributed shared states
Conclusions • System overview for building literature collection by topical web crawlers • Classifier enhanced Best first search performs better than Breadth first search. • Meta-search enhanced topical crawler performs better than topical crawlers without meta-search. • A clustering based method to represent latent community structures in collection
Discussion • Contribution of these two papers • Qin et’ al: enhance meta-search to get multiple starting urls • Pant et’ al: clarify and implement a sound system structure. Post a way to discover latent communities in collection • Constraints • no significant theoretical contribution • experiments not convincing