170 likes | 270 Views
Recent Results in Automatic Web Resource Discovery. Soumen Chakrabartiv Presentation by Cui Tao. Introduction. Classical IR: Indexing a collection of documents Answering queries by returning a ranked list of relevant document Problems for retrieve online document Ambiguity
E N D
Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao
Introduction • Classical IR: • Indexing a collection of documents • Answering queries by returning a ranked list of relevant document • Problems for retrieve online document • Ambiguity • Context sensitivity • Synonymy • Polysemy • Large amount of relevant Web pages
Introduction Directory-based topic browsing: tree-like structure • Most Maintained by human expert • Advantages: exemplary, influential • Disadvantages: slow, subjective and noisy
Introduction • Standard crawler and search engine • 1997: cover 35-40% out of 340 million Web pages • 1999: cover 18% out of 800 million Web pages • Cannot be used for maintaining generic portals and automatic resource discovery
Introduction • Focused crawler: • Can selectively seek out pages that are relevant to pre-defined set of topics • Experts and researchers preferred • Two modules: • Classifier: analyzes the text in and links around a given web page and automatically assigns it to suitable directories in a web catalog • Distiller: identifies the centrality of crawled pages to determine visit priorities
Distillation techniques • Google: • Simulate a random wander on the Web • Ranked by pre-computed popularity and visitation rate • fast
Distillation techniques • HITS (Hyperlink Induced Topic Search): • Depends on a search engine • Combine two scores: • Authorities: identify pages with useful information about a topic • Hubs: identify pages that contain many links to pages with useful information on the topic • Query dependent and slow • May lead topic contamination or drift
Distillation techniques • ARC and CLEVER: • ARC (Automatic Resource Complier): part of CLEVER • Root set was expanded by 2 links instead of 1link ( Including all pages which are link-distance two or less from at least one page in the root set ) • Assign weights to the hyperlinks: base on the match between the query and the text surrounding the hyperlink in the source document
Distillation techniques • Outlier filtering: • Computes relevance weights for pages using Vector Space Model • All pages whose weights are below a threshold are pruned • Effectively prune away outlier nodes in the neighborhood, thus avoid contamination
Topic distillation vs. Resource discovery • Topic distillation: • Depend on large, comprehensive Web crawls and indices (Post processing) • Can be used to generate a Web taxonomy? • Set a keyword query for each node in the taxonomy • Run a distillation program • Simple but have some problems
Topic distillation vs. Resource discovery • Problems: • Construction the query: involves trial, error and complicated thought • Query: “North American telecommunication companies” • Query: +"power suppl*" ßwitch* mode" smps -multiprocessor* üninterrupt* power suppl*" ups -parcel The Yahoo! node /Business&Economy /Companies /Electronics /PowerSupplies • To match the directory based browsing quality of : • Yahoo!: 7.03 terms and 4.34 operators • Alta Vista: 2.35 terms and 0.41 operators
Topic distillation vs. Resource discovery • Problems: • Contamination • stop-sites: not automatic • terming weighting • edge weighing: no precise algorithm to set the weight • Topic distillation by itself is not enough for resource discovery
Hypertext classification: learning from example • Adding example pages and their distance-1 neighbors into the graph to be distilled will improve the result • The contents of the given example and its neighbors provide a way to compute the decision boundary of classification • NN, Bayesian and support vector classifiers
Hypertext classification • Link-based features: important • Circular topic influence • Topic of one page influences its text and its neighbor page’s topic • Knowledge of the linked vicinity’s topic provides clues for the test document’s topic • Bibliometric, more general than the simple linear endorsement model used in topic distillation
Conclusion • Emphasized the importance of scalable automatic resource discovery • Argued that common search engines are not adequate to achieve the resource discovery • Introduced the recently invented focused crawling system
Future Works • How to derive the training examples automatically? • How to personalize the outcome of focused crawler for users?