1 / 20

IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems. Boolean or Statistical?. Most web search engines default to statistical, use Boolean for advanced Most proprietary online systems default to Boolean, use statistical for alternative

nura
Download Presentation

IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

  2. Boolean or Statistical? • Most web search engines default to statistical, use Boolean for advanced • Most proprietary online systems default to Boolean, use statistical for alternative • Statistical search engine vs. relevance ranking of Boolean results

  3. Web Search Engines • Databases generated by robotic programs (non-human) • spiders, wanderers, web walkers, agents • Full-text indexingof website contents • Supports advanced, complex search strategies

  4. 3 Parts of a Web Search Engine 1. Spider or web-crawler • reads webpage, follows links 2. Index • catalogs webpages read by spider 3. Search engine software • matches queries • lists most relevant site first

  5. 3 Parts of an Online System 1) Database building software (dataware) (follows rules with known fields) 2)Index/dictionary file (list of all words and sometimes phrases in the indexed fields) 3) Search engine software (matches queries; Boolean or statistical; LIFO or relevant

  6. AND limits search decreases hits increases precision OR expands search increases precision decreases hits NOT limits search seldom used too strong Proximity Operators Adj, (N)ear, (W)ith limit a search increase precision Boolean Operators

  7. Command Interface Boolean Searching (Westlaw) • Find information about the assumption of risk involving people who fall after slipping in wintery conditions. • assum! /5 risk / p (ic* or snow****) /p (slip! or fell or fall***)

  8. Natural Language and Relevance Ranking (WIN) • I need information on assumption of risk involving a person who has fallen on ice or snow.

  9. Non-Boolean Retrieval Systems • Statistical(associative, probabilistic, or relevance systems) • Linguistic(semantic)

  10. Statistical Retrieval Systems • Incorporate relevance ranking • May incorporate relevance feedback • May have natural language interface • Almost all web search engines use

  11. Algorithm • Latin algorismus, after al-KhwArizmi Arabian mathematician (AD 825) • Step-by-step procedure for solving mathematical problems • Merriam-Webster http://www.m-w.com/ • Statistical search engines use weighting algorithms to compute relevance

  12. Statistical Search Engines • Weighting algorithms are proprietary • Search engines differ in how they assign weights and compute relevance ranking • Search results differ • studies found only about 40% overlap

  13. Statistical Web Retrieval Factors • Popularity, # other sites that link to a site • authoritative sites given heavier weight • Google • Meta-tags may boost ranking • Inktomi/Overture • Direct hit may boost ranking • HotBot

  14. Linguistic Retrieval System • Natural Language & Relevance Ranking • WIN - (Westlaw Is Natural) has some elements • I need information on assumption of risk involving a person who has fallen on ice or snow.

  15. WIN Steps 1. Enter query in plain English 2. System removes stop phrases 3. Matches legal phrases from thesaurus, adjusts weighting 4. Removes stop words

  16. WIN Steps (cont.) 5. Stemming 6. Searches database indexes in OR relationship 7. Statistical comparison applied 8. Results placed in ranked order

  17. Factors in Determining Relevance • Proximity of query words to each other • Position of query words • keywords in title rank higher • keyword in headline or near top • Relative length of document (“normalization”) • Stemming

  18. Factors in Determining Relevance (cont.) • Ignore very frequent terms • Inverse term frequency • Relevance feedback • Stop words • Query expansion/thesaurus

  19. Features Users Can Control • Designating “bound phrases” • Flagging terms that must be present* • Specifying truncat? • Indicating (synonym groups) • Synonym dictionaries

  20. Web Sites that list search engines and features: www.pandia.com www.searchenginewatch.com http://notess.com

More Related