CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #2 January 27, 2000

This Lecture 1.2 Statistical/artificial intelligence techniques 1.3 Boolean and ranking models The Boolean model

1.2 Statistical/artificial intelligence techniques • Statistical techniques • Artificial intelligence

Statistical • No attempt to understand text • Based on statistical measures • Which keywords are important for retrieval? • How do we discover phrases?

Statistical measures • Words occurring many times in document - are important for retrieval • Frequency of occurrences of words • Adjacent pairs of words occurring many times in any order in database - are useful phrases • Frequency of occurrences of pairs

Artificial Intelligence (AI) • Techniques which should enhance retrieval results • Knowledge representation • Natural Language Understanding • Machine learning • Machine translation

Knowledge representation • Knowledge representation - one of most active research areas • Hard problems • Representing common sense knowledge • Updating and scaling up the knowledge • IR model which will be covered

Natural Language Understanding (NLU) • Template filling in narrow domains • (attributes of terrorist stories, for example). • Can answer some questions • Sentence parsing and phrase generation will be covered

Machine learning • Most frequently used techniques in IR include: • Symbolic, inductive learning algorithms such as ID3 • Multiple-layered, feed-forward neural networks • Evolution-based genetic algorithms

Statistical or AI Techniques • Most successful IR systems based on: • statistical techniques, and • some limited AI • Web-based intelligent agents use both • Specialized for a given domain

AI and IR • Started at about the same time • Feigenbaum and Feldamn - “Computers and thought” McGraw Hill 1963. • Minsky - “Semantic Information Processing” MIT Press, 1968. • Salton “Automatic Information Organization and Retrieval” McGraw Hill, 1968.

The early AI approach • Build small domain specific intelligent systems • Use knowledge learned from small domain to build intelligent system for general domain

Some early AI systems • Bobrow - Natural language input for a computer problem solving system (STUDENT solved Algebra problems) • Raphael - SIR Semantic Information Retrieval could accept input statements in a very narrow subset of English and answer questions

Natural Language Understanding (NLU) • Problems with approach - scaling up • To understand natural language need to represent common knowledge • Minsky claimed that 100,000 facts should suffice to represent all such knowledge

Capability • IR systems can retrieve from data bases of millions of documents • NLU based systems retrieve from data bases of thousands of documents

1.3 Boolean and ranked retrieval • Characteristics, advantages and disadvantages of Boolean systems • The concept of rank • Ranking retrieval models

Characteristics • Query formulated by joining: keywords with Boolean operators such as AND, OR and NOT • Vocabulary is often controlled • Thesaurus may be available

Advantages of Boolean Systems • Easy to understand behavior • Enables formulating complex very specific queries

Disadvantages of Boolean Systems • Difficult to formulate complex Boolean query • Output order is not by relevance

Disadvantages of Boolean Systems • All or nothing systems • When users specify (A and B and C and D) should an item with A, B, and C but not D be rejected? • Are all query terms equally important? • Difficult to control size of output. • Too much or too little

The concept of rank • Retrieved documents ordered by decreasing "goodness" (increasing rank) • Rank often computed using a similarityfunction that compares a document and a query

Advantages of ranked systems • In successful IR systems a high percentage of the top document are useful to users

Disadvantages of ranked systems • Behavior of system harder to understand

Ranking IR system - models • Vector space • Fuzzy Boolean • Probabilistic

Ranking IR system - models • Knowledge based • Latent semantic indexing • Inference nets • Neural network and genetic algorithms

Boolean Operators • The and operator (A and B) • Presence of both keywords in document required • For example: “president and Clinton” • both “president” and “Clinton” must occur in a retrieved document

Boolean Operators • The or operator (A or B) • Presence of either or both components • For example: “president or Clinton” • Either “president” or “Clinton” (or both) must occur in a retrieved document

Boolean Operators • The not operator (not A) • Requires the absence of a keyword • For example: “president and (not Clinton”) • Requires the presence of “president” and the absence of “Clinton” in a retrieved document • Used in conjunction with and in retrieval

Boolean Operators • Other operators • xor • adjacency (“Information adjacent retrieval”, “curriculum within 5 words of information”, “logic and inference in the same paragraph”)

Boolean Queries • Parentheses are needed to specify precedence among operations • ((A andnot B) or C) • (A and not (B or C)) • (A or C) and (not B)

Document Indexing • Consider the document: “Algorithm complexity evaluation with curve fitting and interpolation” • Dictionary- all words excluding “with” and “and”

Document dictionary algorithm complexity curve evaluation fitting interpolation

The inverted index file • The file contains: • The dictionary and • The lists of document identifiers in which each index term occurred

Inverted-index file - example • In a database with 250 documents: curve • 12, 25, 36, 89, 125, 128, 215 fitting • 11, 12, 17, 36, 78, 136, 215 interpolation a • 11, 18, 36, 125, 132

The retrieval • Only the inverted index file and the query are used • For query (A and B) - use the set intersection operation on the two inverted lists

The retrieval • For query (A or B) - use the set union operation on the two inverted lists. • For query (A and (not B)) - use theset differenceoperation on the two inverted lists

Example • Let the query be ((curve and fitting) or interpolation)

Example • The lists for these terms are: • curve:{12, 25, 36, 89, 125, 128, 215} • fitting:{11, 12, 17, 36, 78, 136, 215} • interpolation: {11, 18, 36, 125, 132} • For (curve and fitting) the intersection results in {12, 36, 215} • For ((curve and fitting) or interpolation) the union is {11, 12, 18, 36, 125, 132, 215}

CS533 Information Retrieval