1 / 21

Discovering the Unique Challenges of Web Information Retrieval

Explore the distinctive characteristics and challenges of Web IR, including data heterogeneity, duplication, and non-running text. Understand user query behaviors, document languages, and the 'HITS' scoring method.

Download Presentation

Discovering the Unique Challenges of Web Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internet Resources Discovery (IRD) Web IR T.Sharon - A.Frank

  2. Web IR • What’s Different about Web IR? • Web IR Queries • How to Compare Web Search Engines? • The ‘HITS’ Scoring Method T.Sharon - A.Frank

  3. What’s different about the Web? • Bulk ……………... (500M); growth at 20M/month • Lack of Stability ..… Estimates: 1%/day--1%/week • Heterogeneity • Types of documents …. text, pictures, audio, scripts... • Quality • Document Languages ……………. 100+ • Duplication • Non-running text • High Linkage ..…………. 8 links/page average > = T.Sharon - A.Frank

  4. Taxonomy of Web Document Languages DSSSL SGML HyTime XSL XML Metalanguages Languages CSS TEI Lite HTML XHTML RDF MathML SMIL Style sheets T.Sharon - A.Frank

  5. Non-running Text T.Sharon - A.Frank

  6. Make poor queries short (2.35 terms average) imprecise terms sub-optimal syntax (80% without operators) low effort Wide variance on Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one resultscreen only 78% of queries notmodified What’s different about the Web Users? T.Sharon - A.Frank

  7. Why don’t the Users get what they Want? Example User need I need to get rid of mice in the basement Translationproblems User request(verbalized) What’s the best way to trap mice alive? Query toIR system Mouse trap PolysemySynonymy Computer suppliessoftware, etc Results T.Sharon - A.Frank

  8. Alta Vista: Mouse trap T.Sharon - A.Frank

  9. Alta Vista: Mice trap T.Sharon - A.Frank

  10. Challenges on the Web • Distributed data • Dynamic data • Large volume • Unstructured and redundant data • Data quality • Heterogeneous data T.Sharon - A.Frank

  11. Web IR Advantages • High Linkage • Interactivity • Statistics • easy to gather • large sample sizes T.Sharon - A.Frank

  12. Evaluation in the Web Context • Quality of pages varies widely • Relevance is not enough • We need both relevance and high quality = value of page T.Sharon - A.Frank

  13. Example of Web IR Query Results T.Sharon - A.Frank

  14. How to Compare Web Search Engines? • Search engines hold huge repositories! • Search engines hold different resources! • Solution: Precision at top 10 • % of top 10 pages that arerelevant (“ranking quality”) Retrieved (Ret) RR Resources RelevantReturned T.Sharon - A.Frank

  15. The ‘HITS’ Scoring Method • New method from 1998: • improved quality • reduced number of retrieved documents • Based on the Web high linkage • Simplified implementation in Google (www.google.com) • Advanced implementation in Clever • Reminder: Hypertext -nonlinear graph structure T.Sharon - A.Frank

  16. ‘HITS’ Definitions • Authorities: good sources of content • Hubs: good sources of links A H T.Sharon - A.Frank

  17. ‘HITS’ Intuition • Authoritycomes from in-edges.Being a hubcomes from out-edges. • Better authoritycomes from in-edges from hubs.Being a better hubcomes from out-edges to authorities. A H H A A H H A H A T.Sharon - A.Frank

  18. ‘HITS’ Algorithm Repeat until HUB and AUTH converge: Normalize HUB and AUTH HUB[v] := AUTH[ui] for all ui with Edge(v,ui) AUTH[v] := HUB[wi] for all wi with Edge(wi,v) w1 v u1 w2 A H u2 ... ... uk wk T.Sharon - A.Frank

  19. Google Output: Princess Diana T.Sharon - A.Frank

  20. Prototype Implementation (Clever) 1. Selecting documents using index (root) 2. Adding linkeddocuments 3. Iterating tofind hubs and authorities Base Root T.Sharon - A.Frank

  21. By-products • Separates Web sites into clusters. • Reveals the underlying structure of the World Wide Web. T.Sharon - A.Frank

More Related