1 / 38

Statistical and AI Techniques in Information Retrieval

Discover how statistical and AI techniques enhance information retrieval, including keyword importance and phrase discovery. Learn about Boolean and ranking models, statistical measures, AI techniques, and more in this comprehensive lecture. Explore the advantages and disadvantages of Boolean and ranked retrieval systems.

nettiescott
Download Presentation

Statistical and AI Techniques in Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #2 January 27, 2000

  2. This Lecture 1.2 Statistical/artificial intelligence techniques 1.3 Boolean and ranking models The Boolean model

  3. 1.2 Statistical/artificial intelligence techniques • Statistical techniques • Artificial intelligence

  4. Statistical • No attempt to understand text • Based on statistical measures • Which keywords are important for retrieval? • How do we discover phrases?

  5. Statistical measures • Words occurring many times in document - are important for retrieval • Frequency of occurrences of words • Adjacent pairs of words occurring many times in any order in database - are useful phrases • Frequency of occurrences of pairs

  6. Artificial Intelligence (AI) • Techniques which should enhance retrieval results • Knowledge representation • Natural Language Understanding • Machine learning • Machine translation

  7. Knowledge representation • Knowledge representation - one of most active research areas • Hard problems • Representing common sense knowledge • Updating and scaling up the knowledge • IR model which will be covered

  8. Natural Language Understanding (NLU) • Template filling in narrow domains • (attributes of terrorist stories, for example). • Can answer some questions • Sentence parsing and phrase generation will be covered

  9. Machine learning • Most frequently used techniques in IR include: • Symbolic, inductive learning algorithms such as ID3 • Multiple-layered, feed-forward neural networks • Evolution-based genetic algorithms

  10. Statistical or AI Techniques • Most successful IR systems based on: • statistical techniques, and • some limited AI • Web-based intelligent agents use both • Specialized for a given domain

  11. AI and IR • Started at about the same time • Feigenbaum and Feldamn - “Computers and thought” McGraw Hill 1963. • Minsky - “Semantic Information Processing” MIT Press, 1968. • Salton “Automatic Information Organization and Retrieval” McGraw Hill, 1968.

  12. The early AI approach • Build small domain specific intelligent systems • Use knowledge learned from small domain to build intelligent system for general domain

  13. Some early AI systems • Bobrow - Natural language input for a computer problem solving system (STUDENT solved Algebra problems) • Raphael - SIR Semantic Information Retrieval could accept input statements in a very narrow subset of English and answer questions

  14. Natural Language Understanding (NLU) • Problems with approach - scaling up • To understand natural language need to represent common knowledge • Minsky claimed that 100,000 facts should suffice to represent all such knowledge

  15. Capability • IR systems can retrieve from data bases of millions of documents • NLU based systems retrieve from data bases of thousands of documents

  16. 1.3 Boolean and ranked retrieval • Characteristics, advantages and disadvantages of Boolean systems • The concept of rank • Ranking retrieval models

  17. Characteristics • Query formulated by joining: keywords with Boolean operators such as AND, OR and NOT • Vocabulary is often controlled • Thesaurus may be available

  18. Advantages of Boolean Systems • Easy to understand behavior • Enables formulating complex very specific queries

  19. Disadvantages of Boolean Systems • Difficult to formulate complex Boolean query • Output order is not by relevance

  20. Disadvantages of Boolean Systems • All or nothing systems • When users specify (A and B and C and D) should an item with A, B, and C but not D be rejected? • Are all query terms equally important? • Difficult to control size of output. • Too much or too little

  21. The concept of rank • Retrieved documents ordered by decreasing "goodness" (increasing rank) • Rank often computed using a similarityfunction that compares a document and a query

  22. Advantages of ranked systems • In successful IR systems a high percentage of the top document are useful to users

  23. Disadvantages of ranked systems • Behavior of system harder to understand

  24. Ranking IR system - models • Vector space • Fuzzy Boolean • Probabilistic

  25. Ranking IR system - models • Knowledge based • Latent semantic indexing • Inference nets • Neural network and genetic algorithms

  26. Boolean Operators • The and operator (A and B) • Presence of both keywords in document required • For example: “president and Clinton” • both “president” and “Clinton” must occur in a retrieved document

  27. Boolean Operators • The or operator (A or B) • Presence of either or both components • For example: “president or Clinton” • Either “president” or “Clinton” (or both) must occur in a retrieved document

  28. Boolean Operators • The not operator (not A) • Requires the absence of a keyword • For example: “president and (not Clinton”) • Requires the presence of “president” and the absence of “Clinton” in a retrieved document • Used in conjunction with and in retrieval

  29. Boolean Operators • Other operators • xor • adjacency (“Information adjacent retrieval”, “curriculum within 5 words of information”, “logic and inference in the same paragraph”)

  30. Boolean Queries • Parentheses are needed to specify precedence among operations • ((A andnot B) or C) • (A and not (B or C)) • (A or C) and (not B)

  31. Document Indexing • Consider the document: “Algorithm complexity evaluation with curve fitting and interpolation” • Dictionary- all words excluding “with” and “and”

  32. Document dictionary algorithm complexity curve evaluation fitting interpolation

  33. The inverted index file • The file contains: • The dictionary and • The lists of document identifiers in which each index term occurred

  34. Inverted-index file - example • In a database with 250 documents: curve • 12, 25, 36, 89, 125, 128, 215 fitting • 11, 12, 17, 36, 78, 136, 215 interpolation a • 11, 18, 36, 125, 132

  35. The retrieval • Only the inverted index file and the query are used • For query (A and B) - use the set intersection operation on the two inverted lists

  36. The retrieval • For query (A or B) - use the set union operation on the two inverted lists. • For query (A and (not B)) - use theset differenceoperation on the two inverted lists

  37. Example • Let the query be ((curve and fitting) or interpolation)

  38. Example • The lists for these terms are: • curve:{12, 25, 36, 89, 125, 128, 215} • fitting:{11, 12, 17, 36, 78, 136, 215} • interpolation: {11, 18, 36, 125, 132} • For (curve and fitting) the intersection results in {12, 36, 215} • For ((curve and fitting) or interpolation) the union is {11, 12, 18, 36, 125, 132, 215}

More Related