1 / 19

Topical Crawlers for Building Digital Library Collections

Topical Crawlers for Building Digital Library Collections. Presenter: Qiaozhu Mei. 1. J.Qin et al. Building Domain Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method

faxon
Download Presentation

Topical Crawlers for Building Digital Library Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei

  2. 1. J.Qin et al. Building Domain Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method • 2. G.Pant et al. Panorama: Extending Digital Libraries with Topical Crawlers • (JCDL 2004)

  3. Outline • Problem Description • Research Background • Their Approaches • Designing Classifier • Enhancing Meta-Search • Identifying Communities in Collection • Experiments • Discussion

  4. Problem Description • Problem: Collect domain specific documents from the Web and manage the literature collection. • Topical Crawlers (Focused Crawlers) are designed to collect domain specific docs • How to bridge the gab of Web communities.

  5. Digital Library vs. Search Engine • Digital Library • Domain specific, serving for literature study • High Quality • Topical Crawler + Collection management • Knowledge discovery • Search Engine • General, serving for web search • High Quantity • General Crawler + Online retrieval • Indexing, retrieving performance, etc.

  6. Domain Definition Expand training set, Get starting url BFS, Best first search, Tree pruning, Multiple starting urls, Tunneling VSM, Naïve Bayesian, SVM, etc. TF-IDF,K-Mean, etc Page Rank, HITS, etc. Research Background

  7. Why General Crawler with Breadth first search doesn’t work?

  8. Web Communities

  9. Design Classifier (Pant et al.) • Motivation: define the domain and distinguish relevant & non-relevant documents • Approach: • Query Google with title & reference to construct positive/negative example set (training set) • Use Vector Space Model to represent documents, use TF-IDF as term weights • Use Naïve Bayesian Classifier to estimate Pr(c+|q), which is used for ranking

  10. Design Classifier (cont.) • TF-IDF weighting:

  11. Enhancing Meta-Search (Qin et al.) • Motivation: Solve the limitation of Local Search algorithm in Crawling, bridge distributed web communities • Approach: • Manually provide domain specific queries • Query Meta-search Engine to get multiple starting urls.

  12. Identifying Communities in Collection(Pant et al.) • Motivation: analyze the latent structures in collection, summarize and represent potential communities • Approach: • Use k-mean for content clustering • Use HITS for structural clustering • Label clusters by TF-IDF filtering

  13. Experiments (Qin et al.) • Experiments Design • Compare with Google and a Domain Specific SE. • 996028 pages, 1/3 from meta-search method. • pre@20. • Compare meta-search enhanced crawling with traditional one, by means of precision. • 997632 pages from baseline method. • pre@10 • Experts define queries and judge results.

  14. Experiments (Qin et al.) cont. • Experiment Result 1: • Their approach: • General SE: • Domain Specific: • Meta-search enhanced method better than general search engine and traditional domain specific search engine. • Experiment Result 2 • Expert ranking results in range 1-4 • Meta-search Enhanced: 2.77 • Baseline: 2.51 • In Top 100 results from Meta-search Enhanced collection: • from meta-search: 3.22, rest: 2.61

  15. Experiments (Pant et al.) • Experiments Design: • Test Bed: from CiteSeer. 94 papers as initial documents. • Use one (expanded by querying Google) for building positive example set, 93 for building negative example set. • Compare with a BFS crawler. • Harvest rate:

  16. Experiments (Pant et al.) cont. • Experiment results: • InterWeave: A middleware system for distributed shared states

  17. Conclusions • System overview for building literature collection by topical web crawlers • Classifier enhanced Best first search performs better than Breadth first search. • Meta-search enhanced topical crawler performs better than topical crawlers without meta-search. • A clustering based method to represent latent community structures in collection

  18. Discussion • Contribution of these two papers • Qin et’ al: enhance meta-search to get multiple starting urls • Pant et’ al: clarify and implement a sound system structure. Post a way to discover latent communities in collection • Constraints • no significant theoretical contribution • experiments not convincing

  19. Thanks!

More Related