1 / 17

Recent Results in Automatic Web Resource Discovery

Recent Results in Automatic Web Resource Discovery. Soumen Chakrabartiv Presentation by Cui Tao. Introduction. Classical IR: Indexing a collection of documents Answering queries by returning a ranked list of relevant document Problems for retrieve online document Ambiguity

Download Presentation

Recent Results in Automatic Web Resource Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao

  2. Introduction • Classical IR: • Indexing a collection of documents • Answering queries by returning a ranked list of relevant document • Problems for retrieve online document • Ambiguity • Context sensitivity • Synonymy • Polysemy • Large amount of relevant Web pages

  3. Introduction Directory-based topic browsing: tree-like structure • Most Maintained by human expert • Advantages: exemplary, influential • Disadvantages: slow, subjective and noisy

  4. Introduction • Standard crawler and search engine • 1997: cover 35-40% out of 340 million Web pages • 1999: cover 18% out of 800 million Web pages • Cannot be used for maintaining generic portals and automatic resource discovery

  5. Introduction • Focused crawler: • Can selectively seek out pages that are relevant to pre-defined set of topics • Experts and researchers preferred • Two modules: • Classifier: analyzes the text in and links around a given web page and automatically assigns it to suitable directories in a web catalog • Distiller: identifies the centrality of crawled pages to determine visit priorities

  6. Distillation techniques • Google: • Simulate a random wander on the Web • Ranked by pre-computed popularity and visitation rate • fast

  7. Distillation techniques • HITS (Hyperlink Induced Topic Search): • Depends on a search engine • Combine two scores: • Authorities: identify pages with useful information about a topic • Hubs: identify pages that contain many links to pages with useful information on the topic • Query dependent and slow • May lead topic contamination or drift

  8. Distillation techniques • ARC and CLEVER: • ARC (Automatic Resource Complier): part of CLEVER • Root set was expanded by 2 links instead of 1link ( Including all pages which are link-distance two or less from at least one page in the root set ) • Assign weights to the hyperlinks: base on the match between the query and the text surrounding the hyperlink in the source document

  9. Distillation techniques • Outlier filtering: • Computes relevance weights for pages using Vector Space Model • All pages whose weights are below a threshold are pruned • Effectively prune away outlier nodes in the neighborhood, thus avoid contamination

  10. Topic distillation vs. Resource discovery • Topic distillation: • Depend on large, comprehensive Web crawls and indices (Post processing) • Can be used to generate a Web taxonomy? • Set a keyword query for each node in the taxonomy • Run a distillation program • Simple but have some problems

  11. Topic distillation vs. Resource discovery • Problems: • Construction the query: involves trial, error and complicated thought • Query: “North American telecommunication companies” • Query: +"power suppl*" ßwitch* mode" smps -multiprocessor* üninterrupt* power suppl*" ups -parcel The Yahoo! node /Business&Economy /Companies /Electronics /PowerSupplies • To match the directory based browsing quality of : • Yahoo!: 7.03 terms and 4.34 operators • Alta Vista: 2.35 terms and 0.41 operators

  12. Topic distillation vs. Resource discovery • Problems: • Contamination • stop-sites: not automatic • terming weighting • edge weighing: no precise algorithm to set the weight • Topic distillation by itself is not enough for resource discovery

  13. Hypertext classification: learning from example • Adding example pages and their distance-1 neighbors into the graph to be distilled will improve the result • The contents of the given example and its neighbors provide a way to compute the decision boundary of classification • NN, Bayesian and support vector classifiers

  14. Hypertext classification • Link-based features: important • Circular topic influence • Topic of one page influences its text and its neighbor page’s topic • Knowledge of the linked vicinity’s topic provides clues for the test document’s topic • Bibliometric, more general than the simple linear endorsement model used in topic distillation

  15. Putting it together for resource discovery

  16. Conclusion • Emphasized the importance of scalable automatic resource discovery • Argued that common search engines are not adequate to achieve the resource discovery • Introduced the recently invented focused crawling system

  17. Future Works • How to derive the training examples automatically? • How to personalize the outcome of focused crawler for users?

More Related