210 likes | 224 Views
Explore the distinctive characteristics and challenges of Web IR, including data heterogeneity, duplication, and non-running text. Understand user query behaviors, document languages, and the 'HITS' scoring method.
E N D
Internet Resources Discovery (IRD) Web IR T.Sharon - A.Frank
Web IR • What’s Different about Web IR? • Web IR Queries • How to Compare Web Search Engines? • The ‘HITS’ Scoring Method T.Sharon - A.Frank
What’s different about the Web? • Bulk ……………... (500M); growth at 20M/month • Lack of Stability ..… Estimates: 1%/day--1%/week • Heterogeneity • Types of documents …. text, pictures, audio, scripts... • Quality • Document Languages ……………. 100+ • Duplication • Non-running text • High Linkage ..…………. 8 links/page average > = T.Sharon - A.Frank
Taxonomy of Web Document Languages DSSSL SGML HyTime XSL XML Metalanguages Languages CSS TEI Lite HTML XHTML RDF MathML SMIL Style sheets T.Sharon - A.Frank
Non-running Text T.Sharon - A.Frank
Make poor queries short (2.35 terms average) imprecise terms sub-optimal syntax (80% without operators) low effort Wide variance on Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one resultscreen only 78% of queries notmodified What’s different about the Web Users? T.Sharon - A.Frank
Why don’t the Users get what they Want? Example User need I need to get rid of mice in the basement Translationproblems User request(verbalized) What’s the best way to trap mice alive? Query toIR system Mouse trap PolysemySynonymy Computer suppliessoftware, etc Results T.Sharon - A.Frank
Alta Vista: Mouse trap T.Sharon - A.Frank
Alta Vista: Mice trap T.Sharon - A.Frank
Challenges on the Web • Distributed data • Dynamic data • Large volume • Unstructured and redundant data • Data quality • Heterogeneous data T.Sharon - A.Frank
Web IR Advantages • High Linkage • Interactivity • Statistics • easy to gather • large sample sizes T.Sharon - A.Frank
Evaluation in the Web Context • Quality of pages varies widely • Relevance is not enough • We need both relevance and high quality = value of page T.Sharon - A.Frank
Example of Web IR Query Results T.Sharon - A.Frank
How to Compare Web Search Engines? • Search engines hold huge repositories! • Search engines hold different resources! • Solution: Precision at top 10 • % of top 10 pages that arerelevant (“ranking quality”) Retrieved (Ret) RR Resources RelevantReturned T.Sharon - A.Frank
The ‘HITS’ Scoring Method • New method from 1998: • improved quality • reduced number of retrieved documents • Based on the Web high linkage • Simplified implementation in Google (www.google.com) • Advanced implementation in Clever • Reminder: Hypertext -nonlinear graph structure T.Sharon - A.Frank
‘HITS’ Definitions • Authorities: good sources of content • Hubs: good sources of links A H T.Sharon - A.Frank
‘HITS’ Intuition • Authoritycomes from in-edges.Being a hubcomes from out-edges. • Better authoritycomes from in-edges from hubs.Being a better hubcomes from out-edges to authorities. A H H A A H H A H A T.Sharon - A.Frank
‘HITS’ Algorithm Repeat until HUB and AUTH converge: Normalize HUB and AUTH HUB[v] := AUTH[ui] for all ui with Edge(v,ui) AUTH[v] := HUB[wi] for all wi with Edge(wi,v) w1 v u1 w2 A H u2 ... ... uk wk T.Sharon - A.Frank
Google Output: Princess Diana T.Sharon - A.Frank
Prototype Implementation (Clever) 1. Selecting documents using index (root) 2. Adding linkeddocuments 3. Iterating tofind hubs and authorities Base Root T.Sharon - A.Frank
By-products • Separates Web sites into clusters. • Reveals the underlying structure of the World Wide Web. T.Sharon - A.Frank