CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #26 May 11, 2000

AI and IR • Started at about the same time • Feigenbaum and Feldamn - “Computers and thought” McGraw Hill 1963. • Minsky - “Semantic Information Processing” MIT Press, 1968. • Salton “Automatic Information Organization and Retrieval” McGraw Hill, 1968.

Advantages of Boolean Systems • Easy to understand behavior • Enables formulating complex very specific queries

Disadvantages of Boolean Systems • Difficult to formulate complex Boolean query • Output order is not by relevance

Disadvantages of Boolean Systems • All or nothing systems • When users specify (A and B and C and D) should an item with A, B, and C but not D be rejected? • Are all query terms equally important? • Difficult to control size of output. • Too much or too little

The concept of rank • Retrieved documents ordered by decreasing "goodness" (increasing rank) • Rank often computed using a similarityfunction that compares a document and a query

Advantages of ranked systems • In successful IR systems a high percentage of the top document are useful to users

Disadvantages of ranked systems • Behavior of system harder to understand

Ranking IR system - models • Vector space • Fuzzy Boolean • Probabilistic

Ranking IR system - models • Knowledge based • Latent semantic indexing • Inference nets • Neural network and genetic algorithms *

The Concept of Relevance • Relevance of a document D to a query Q is subjective • Different users will have different judgements • Same users may judge differently at different times • Degree of relevance of different documents will vary

The Concept of Relevance • In evaluating IR systems it is assumed that: • A subset of the documents of the database (DB) are relevant • A document is either relevant or not

Indexing Effectiveness • Indexing exhaustively and • Term specificity

Stop lists • A stop list is a list of terms which are not included in an index • Traditionally most frequently occurring English words. • “computer, machine, program, source, language” in a computer science collection • Some loss of content “to be or not to be”

Stemming is used to: • Enhance query formulation (and improve recall) by providing term variants • Reduce size of index files by combining term variants into single index term

n-grams • Fixed length consecutive series of “n” characters • Bigrams: • Sea colony -> (se ea co ol lo on ny) • Trigrams • Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#)

Usage of n-grams • Used in world war II by cryptographers • Spell checking • Text compression • Signature files • Stemming

The Vector Space Model • Queries and documents are represented by vectors • Assumes document terms and query terms are independent • Term weight • Variants and meaning of tf and idf • Different normalization schemes

Probabilistic information retrieval • Binary independence model • Non-binary independence models

Binary independence model

Fuzzy Boolean Models • Limitations of the Boolean model • Fuzzy models • basic • MMM • Paice • p-norm

Designed to overcome: Language variability problem where a user expresses a concept with different words than those used in a document The multiple meaning of words Uses SVD or two-mode factor analysis Latent semantic indexing

Knowledge Based IR • Knowledge based information retrieval attempts to identify the occurrence of high level concepts in • Concepts and their relationship represent the knowledge needed for retrieval • Evidential reasoning provide the link between a document and its concepts

Inference Networks for IR • Turtle and Croft introduced the inference network model for information retrieval • This is a probability-based method • Ranks documents by probability of satisfying a user's information need.

Evaluation • Fallout • Recall and precision • 11 point recall/precision • Average precision

Building inverted files • Memory based • Sort based • Text partitioning • Lexical partitioning (FASTINV)

Signature file • Alternative to inverted index • A compressed representation of documents • Uses n-grams and hashing • Enable searching for prefix and part of words • No ranks • Techniques to increase efficiency

An alternative data structure to using inverted files • Patricia trees (also called suffix trees) • PAT arrays (also called suffix arrays)

Search engines • Robots and indexing • Using hypertext links to improve retrieval • PageRank - importance of documents • Hubs and Authorities • Webor

Metasearch Engine Two observations about search engines: • Web pages a user needs are frequently stored in multiple search engines. • The coverage of each search engine is limited. • Combining multiple search engines may increase the coverage. A metasearch engine is a good mechanism for solving these problems.

Metasearch Engines • Data selection problem • Query formulation problem • Result merging problem

Clustering • Some clustering algorithms • Document clustering • Term clustering • Cluster based retrieval

Phrases and Thesaurus • Usages • Phrase generation and recognition • Techniques for automatic building of corpus based thesaurus

Relevance feedback • The main idea • Issues • Query modification examples

Extracts/intelligent abstracts • IR Extracts are lists of fragments of text • IE extracts - extracts words/phrases to generate an abstract • Intelligent abstracts re-phrase content coherently (no redundant text, may use generalizations, etc.)

Themes, and text traversals • Text traversals provide a reader with a path of text excerpts • User can specify how large text traversal should be • The traversal can also be in response to a query

CS533 Information Retrieval