380 likes | 388 Views
CS533 Information Retrieval. Dr. Michal Cutler Lecture #2 January 27, 2000. This Lecture. 1.2 Statistical/artificial intelligence techniques 1.3 Boolean and ranking models The Boolean model. 1.2 Statistical/artificial intelligence techniques. Statistical techniques
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #2 January 27, 2000
This Lecture 1.2 Statistical/artificial intelligence techniques 1.3 Boolean and ranking models The Boolean model
1.2 Statistical/artificial intelligence techniques • Statistical techniques • Artificial intelligence
Statistical • No attempt to understand text • Based on statistical measures • Which keywords are important for retrieval? • How do we discover phrases?
Statistical measures • Words occurring many times in document - are important for retrieval • Frequency of occurrences of words • Adjacent pairs of words occurring many times in any order in database - are useful phrases • Frequency of occurrences of pairs
Artificial Intelligence (AI) • Techniques which should enhance retrieval results • Knowledge representation • Natural Language Understanding • Machine learning • Machine translation
Knowledge representation • Knowledge representation - one of most active research areas • Hard problems • Representing common sense knowledge • Updating and scaling up the knowledge • IR model which will be covered
Natural Language Understanding (NLU) • Template filling in narrow domains • (attributes of terrorist stories, for example). • Can answer some questions • Sentence parsing and phrase generation will be covered
Machine learning • Most frequently used techniques in IR include: • Symbolic, inductive learning algorithms such as ID3 • Multiple-layered, feed-forward neural networks • Evolution-based genetic algorithms
Statistical or AI Techniques • Most successful IR systems based on: • statistical techniques, and • some limited AI • Web-based intelligent agents use both • Specialized for a given domain
AI and IR • Started at about the same time • Feigenbaum and Feldamn - “Computers and thought” McGraw Hill 1963. • Minsky - “Semantic Information Processing” MIT Press, 1968. • Salton “Automatic Information Organization and Retrieval” McGraw Hill, 1968.
The early AI approach • Build small domain specific intelligent systems • Use knowledge learned from small domain to build intelligent system for general domain
Some early AI systems • Bobrow - Natural language input for a computer problem solving system (STUDENT solved Algebra problems) • Raphael - SIR Semantic Information Retrieval could accept input statements in a very narrow subset of English and answer questions
Natural Language Understanding (NLU) • Problems with approach - scaling up • To understand natural language need to represent common knowledge • Minsky claimed that 100,000 facts should suffice to represent all such knowledge
Capability • IR systems can retrieve from data bases of millions of documents • NLU based systems retrieve from data bases of thousands of documents
1.3 Boolean and ranked retrieval • Characteristics, advantages and disadvantages of Boolean systems • The concept of rank • Ranking retrieval models
Characteristics • Query formulated by joining: keywords with Boolean operators such as AND, OR and NOT • Vocabulary is often controlled • Thesaurus may be available
Advantages of Boolean Systems • Easy to understand behavior • Enables formulating complex very specific queries
Disadvantages of Boolean Systems • Difficult to formulate complex Boolean query • Output order is not by relevance
Disadvantages of Boolean Systems • All or nothing systems • When users specify (A and B and C and D) should an item with A, B, and C but not D be rejected? • Are all query terms equally important? • Difficult to control size of output. • Too much or too little
The concept of rank • Retrieved documents ordered by decreasing "goodness" (increasing rank) • Rank often computed using a similarityfunction that compares a document and a query
Advantages of ranked systems • In successful IR systems a high percentage of the top document are useful to users
Disadvantages of ranked systems • Behavior of system harder to understand
Ranking IR system - models • Vector space • Fuzzy Boolean • Probabilistic
Ranking IR system - models • Knowledge based • Latent semantic indexing • Inference nets • Neural network and genetic algorithms
Boolean Operators • The and operator (A and B) • Presence of both keywords in document required • For example: “president and Clinton” • both “president” and “Clinton” must occur in a retrieved document
Boolean Operators • The or operator (A or B) • Presence of either or both components • For example: “president or Clinton” • Either “president” or “Clinton” (or both) must occur in a retrieved document
Boolean Operators • The not operator (not A) • Requires the absence of a keyword • For example: “president and (not Clinton”) • Requires the presence of “president” and the absence of “Clinton” in a retrieved document • Used in conjunction with and in retrieval
Boolean Operators • Other operators • xor • adjacency (“Information adjacent retrieval”, “curriculum within 5 words of information”, “logic and inference in the same paragraph”)
Boolean Queries • Parentheses are needed to specify precedence among operations • ((A andnot B) or C) • (A and not (B or C)) • (A or C) and (not B)
Document Indexing • Consider the document: “Algorithm complexity evaluation with curve fitting and interpolation” • Dictionary- all words excluding “with” and “and”
Document dictionary algorithm complexity curve evaluation fitting interpolation
The inverted index file • The file contains: • The dictionary and • The lists of document identifiers in which each index term occurred
Inverted-index file - example • In a database with 250 documents: curve • 12, 25, 36, 89, 125, 128, 215 fitting • 11, 12, 17, 36, 78, 136, 215 interpolation a • 11, 18, 36, 125, 132
The retrieval • Only the inverted index file and the query are used • For query (A and B) - use the set intersection operation on the two inverted lists
The retrieval • For query (A or B) - use the set union operation on the two inverted lists. • For query (A and (not B)) - use theset differenceoperation on the two inverted lists
Example • Let the query be ((curve and fitting) or interpolation)
Example • The lists for these terms are: • curve:{12, 25, 36, 89, 125, 128, 215} • fitting:{11, 12, 17, 36, 78, 136, 215} • interpolation: {11, 18, 36, 125, 132} • For (curve and fitting) the intersection results in {12, 36, 215} • For ((curve and fitting) or interpolation) the union is {11, 12, 18, 36, 125, 132, 215}