380 likes | 396 Views
This lecture covers statistical and artificial intelligence techniques for information retrieval, including the Boolean model and ranking models. It explores how statistical measures can determine important keywords and discover useful phrases, while AI techniques enhance retrieval results through knowledge representation, natural language understanding, machine learning, and machine translation.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #2 January 27, 1999
This Lecture 1.2 Statistical/artificial intelligence techniques 1.3 Boolean and ranking models The Boolean model
1.2 Statistical/artificial intelligence techniques • Statistical techniques • Artificial intelligence
Statistical • No attempt to understand text • Based on statistical measures • Which keywords are important for retrieval? • How do we discover phrases?
Statistical measures • Words occurring many times in document - are important for retrieval • Frequency of occurrences of words • Adjacent pairs of words occurring many times in any order in database - are useful phrases • Frequency of occurrences of pairs
Artificial Intelligence (AI) • Techniques which should enhance retrieval results • Knowledge representation • Natural Language Understanding • Machine learning • Machine translation
Knowledge representation • Knowledge representation - one of most active research areas • Hard problems • Representing common sense knowledge • Updating and scaling up the knowledge • IR model which will be covered
Natural Language Understanding (NLU) • Template filling in narrow domains • (attributes of terrorist stories, for example). • Can answer some questions • Sentence parsing and phrase generation will be covered
Machine learning • Most frequently used techniques in IR include: • Symbolic, inductive learning algorithms such as ID3 • Multiple-layered, feed-forward neural networks • Evolution-based genetic algorithms
Statistical or AI Techniques • Most successful IR systems based on: • statistical techniques, and • some limited AI • Web-based intelligent agents use both • Specialized for a given domain
AI and IR • Started at about the same time • Feigenbaum and Feldamn - “Computers and thought” McGraw Hill 1963. • Minsky - “Semantic Information Processing” MIT Press, 1968. • Salton “Automatic Information Organization and Retrieval” McGraw Hill, 1968.
The early AI approach • Build small domain specific intelligent systems • Use knowledge learned from small domain to build intelligent system for general domain
Some early AI systems • Bobrow - Natural language input for a computer problem solving system (STUDENT solved Algebra problems) • Raphael - SIR Semantic Information Retrieval could accept input statements in a very narrow subset of English and answer questions
Natural Language Understanding (NLU) • Problems with approach - scaling up • To understand natural language need to represent common knowledge • Minsky claimed that 100,000 facts should suffice to represent all such knowledge
Capability • IR systems can retrieve from data bases of millions of documents • NLU based systems retrieve from data bases of thousands of documents
1.3 Boolean and ranked retrieval • Characteristics, advantages and disadvantages of Boolean systems • The concept of rank • Ranking retrieval models
Characteristics • Query formulated by joining: keywords with Boolean operators such as AND, OR and NOT • Vocabulary is often controlled • Thesaurus may be available
Advantages of Boolean Systems • Easy to understand behavior • Enables formulating complex very specific queries
Disadvantages of Boolean Systems • Difficult to formulate complex Boolean query • Output order is not by relevance
Disadvantages of Boolean Systems • All or nothing systems • When users specify (A and B and C and D) should an item with A, B, and C but not D be rejected? • Are all query terms equally important? • Difficult to control size of output. • Too much or too little
The concept of rank • Retrieved documents ordered by decreasing "goodness" (increasing rank) • Rank often computed using a similarityfunction that compares a document and a query
Advantages of ranked systems • In successful IR systems a high percentage of the top document are useful to users
Disadvantages of ranked systems • Behavior of system harder to understand
Ranking IR system - models • Vector space • Fuzzy Boolean • Probabilistic
Ranking IR system - models • Knowledge based • Latent semantic indexing • Inference nets • Neural network and genetic algorithms
Boolean Operators • The and operator (A and B) • Presence of both keywords in document required • For example: “president and Clinton” • both “president” and “Clinton” must occur in a retrieved document
Boolean Operators • The or operator (A or B) • Presence of either or both components • For example: “president or Clinton” • Either “president” or “Clinton” (or both) must occur in a retrieved document
Boolean Operators • The not operator (not A) • Requires the absence of a keyword • For example: “president and (not Clinton”) • Requires the presence of “president” and the absence of “Clinton” in a retrieved document • Used in conjunction with and in retrieval
Boolean Operators • Other operators • xor • adjacency (“Information adjacent retrieval”, “curriculum within 5 words of information”, “logic and inference in the same paragraph”)
Boolean Queries • Parentheses are needed to specify precedence among operations • ((A andnot B) or C) • (A and not (B or C)) • (A or C) and (not B)
Document Indexing • Consider the document: “Algorithm complexity evaluation with curve fitting and interpolation” • Dictionary- all words excluding “with” and “and”
Document dictionary algorithm complexity curve evaluation fitting interpolation
The inverted index file • The file contains: • The dictionary and • The lists of document identifiers in which each index term occurred
Inverted-index file - example • In a database with 250 documents: curve • 12, 25, 36, 89, 125, 128, 215 fitting • 11, 12, 17, 36, 78, 136, 215 interpolation a • 11, 18, 36, 125, 132
The retrieval • Only the inverted index file and the query are used • For query (A and B) - use the set intersection operation on the two inverted lists
The retrieval • For query (A or B) - use the set union operation on the two inverted lists. • For query (A and (not B)) - use theset differenceoperation on the two inverted lists
Example • Let the query be ((curve and fitting) or interpolation)
Example • The lists for these terms are: • curve:{12, 25, 36, 89, 125, 128, 215} • fitting:{11, 12, 17, 36, 78, 136, 215} • interpolation: {11, 18, 36, 125, 132} • For (curve and fitting) the intersection results in {12, 36, 215} • For ((curve and fitting) or interpolation) the union is {11, 12, 18, 36, 125, 132, 215}