Statistical and AI Techniques in Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #2 January 27, 2000

This Lecture 1.2 Statistical/artificial intelligence techniques 1.3 Boolean and ranking models The Boolean model

1.2 Statistical/artificial intelligence techniques • Statistical techniques • Artificial intelligence

Statistical • No attempt to understand text • Based on statistical measures • Which keywords are important for retrieval? • How do we discover phrases?

Statistical measures • Words occurring many times in document - are important for retrieval • Frequency of occurrences of words • Adjacent pairs of words occurring many times in any order in database - are useful phrases • Frequency of occurrences of pairs

Artificial Intelligence (AI) • Techniques which should enhance retrieval results • Knowledge representation • Natural Language Understanding • Machine learning • Machine translation

Knowledge representation • Knowledge representation - one of most active research areas • Hard problems • Representing common sense knowledge • Updating and scaling up the knowledge • IR model which will be covered

Natural Language Understanding (NLU) • Template filling in narrow domains • (attributes of terrorist stories, for example). • Can answer some questions • Sentence parsing and phrase generation will be covered

Machine learning • Most frequently used techniques in IR include: • Symbolic, inductive learning algorithms such as ID3 • Multiple-layered, feed-forward neural networks • Evolution-based genetic algorithms

Statistical or AI Techniques • Most successful IR systems based on: • statistical techniques, and • some limited AI • Web-based intelligent agents use both • Specialized for a given domain

AI and IR • Started at about the same time • Feigenbaum and Feldamn - “Computers and thought” McGraw Hill 1963. • Minsky - “Semantic Information Processing” MIT Press, 1968. • Salton “Automatic Information Organization and Retrieval” McGraw Hill, 1968.

The early AI approach • Build small domain specific intelligent systems • Use knowledge learned from small domain to build intelligent system for general domain

Some early AI systems • Bobrow - Natural language input for a computer problem solving system (STUDENT solved Algebra problems) • Raphael - SIR Semantic Information Retrieval could accept input statements in a very narrow subset of English and answer questions

Natural Language Understanding (NLU) • Problems with approach - scaling up • To understand natural language need to represent common knowledge • Minsky claimed that 100,000 facts should suffice to represent all such knowledge

Capability • IR systems can retrieve from data bases of millions of documents • NLU based systems retrieve from data bases of thousands of documents

1.3 Boolean and ranked retrieval • Characteristics, advantages and disadvantages of Boolean systems • The concept of rank • Ranking retrieval models

Characteristics • Query formulated by joining: keywords with Boolean operators such as AND, OR and NOT • Vocabulary is often controlled • Thesaurus may be available

Advantages of Boolean Systems • Easy to understand behavior • Enables formulating complex very specific queries

Disadvantages of Boolean Systems • Difficult to formulate complex Boolean query • Output order is not by relevance

Disadvantages of Boolean Systems • All or nothing systems • When users specify (A and B and C and D) should an item with A, B, and C but not D be rejected? • Are all query terms equally important? • Difficult to control size of output. • Too much or too little

The concept of rank • Retrieved documents ordered by decreasing "goodness" (increasing rank) • Rank often computed using a similarityfunction that compares a document and a query

Advantages of ranked systems • In successful IR systems a high percentage of the top document are useful to users

Disadvantages of ranked systems • Behavior of system harder to understand

Ranking IR system - models • Vector space • Fuzzy Boolean • Probabilistic

Ranking IR system - models • Knowledge based • Latent semantic indexing • Inference nets • Neural network and genetic algorithms

Boolean Operators • The and operator (A and B) • Presence of both keywords in document required • For example: “president and Clinton” • both “president” and “Clinton” must occur in a retrieved document

Boolean Operators • The or operator (A or B) • Presence of either or both components • For example: “president or Clinton” • Either “president” or “Clinton” (or both) must occur in a retrieved document

Boolean Operators • The not operator (not A) • Requires the absence of a keyword • For example: “president and (not Clinton”) • Requires the presence of “president” and the absence of “Clinton” in a retrieved document • Used in conjunction with and in retrieval

Boolean Operators • Other operators • xor • adjacency (“Information adjacent retrieval”, “curriculum within 5 words of information”, “logic and inference in the same paragraph”)

Boolean Queries • Parentheses are needed to specify precedence among operations • ((A andnot B) or C) • (A and not (B or C)) • (A or C) and (not B)

Document Indexing • Consider the document: “Algorithm complexity evaluation with curve fitting and interpolation” • Dictionary- all words excluding “with” and “and”

Document dictionary algorithm complexity curve evaluation fitting interpolation

The inverted index file • The file contains: • The dictionary and • The lists of document identifiers in which each index term occurred

Inverted-index file - example • In a database with 250 documents: curve • 12, 25, 36, 89, 125, 128, 215 fitting • 11, 12, 17, 36, 78, 136, 215 interpolation a • 11, 18, 36, 125, 132

The retrieval • Only the inverted index file and the query are used • For query (A and B) - use the set intersection operation on the two inverted lists

The retrieval • For query (A or B) - use the set union operation on the two inverted lists. • For query (A and (not B)) - use theset differenceoperation on the two inverted lists

Example • Let the query be ((curve and fitting) or interpolation)

Example • The lists for these terms are: • curve:{12, 25, 36, 89, 125, 128, 215} • fitting:{11, 12, 17, 36, 78, 136, 215} • interpolation: {11, 18, 36, 125, 132} • For (curve and fitting) the intersection results in {12, 36, 215} • For ((curve and fitting) or interpolation) the union is {11, 12, 18, 36, 125, 132, 215}

Statistical and AI Techniques in Information Retrieval

Statistical and AI Techniques in Information Retrieval

Presentation Transcript

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval