360 likes | 368 Views
This lecture discusses the advantages and disadvantages of Boolean and ranked information retrieval systems. It examines the behavior, complexity, and output order of Boolean systems, as well as the understandability and usefulness of ranked systems. Various ranking models, such as vector space, fuzzy Boolean, probabilistic, knowledge-based, latent semantic indexing, inference networks, neural networks, and genetic algorithms, are explored. The concept of relevance, indexing effectiveness, stemming, n-grams, the vector space model, probabilistic information retrieval, fuzzy Boolean models, latent semantic indexing, knowledge-based IR, inference networks, and evaluation methods are also covered. The lecture concludes with discussions on building inverted files, alternative data structures, search engines, and metasearch engines.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #26 May 11, 2000
AI and IR • Started at about the same time • Feigenbaum and Feldamn - “Computers and thought” McGraw Hill 1963. • Minsky - “Semantic Information Processing” MIT Press, 1968. • Salton “Automatic Information Organization and Retrieval” McGraw Hill, 1968.
Advantages of Boolean Systems • Easy to understand behavior • Enables formulating complex very specific queries
Disadvantages of Boolean Systems • Difficult to formulate complex Boolean query • Output order is not by relevance
Disadvantages of Boolean Systems • All or nothing systems • When users specify (A and B and C and D) should an item with A, B, and C but not D be rejected? • Are all query terms equally important? • Difficult to control size of output. • Too much or too little
The concept of rank • Retrieved documents ordered by decreasing "goodness" (increasing rank) • Rank often computed using a similarityfunction that compares a document and a query
Advantages of ranked systems • In successful IR systems a high percentage of the top document are useful to users
Disadvantages of ranked systems • Behavior of system harder to understand
Ranking IR system - models • Vector space • Fuzzy Boolean • Probabilistic
Ranking IR system - models • Knowledge based • Latent semantic indexing • Inference nets • Neural network and genetic algorithms *
The Concept of Relevance • Relevance of a document D to a query Q is subjective • Different users will have different judgements • Same users may judge differently at different times • Degree of relevance of different documents will vary
The Concept of Relevance • In evaluating IR systems it is assumed that: • A subset of the documents of the database (DB) are relevant • A document is either relevant or not
Indexing Effectiveness • Indexing exhaustively and • Term specificity
Stop lists • A stop list is a list of terms which are not included in an index • Traditionally most frequently occurring English words. • “computer, machine, program, source, language” in a computer science collection • Some loss of content “to be or not to be”
Stemming is used to: • Enhance query formulation (and improve recall) by providing term variants • Reduce size of index files by combining term variants into single index term
n-grams • Fixed length consecutive series of “n” characters • Bigrams: • Sea colony -> (se ea co ol lo on ny) • Trigrams • Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#)
Usage of n-grams • Used in world war II by cryptographers • Spell checking • Text compression • Signature files • Stemming
The Vector Space Model • Queries and documents are represented by vectors • Assumes document terms and query terms are independent • Term weight • Variants and meaning of tf and idf • Different normalization schemes
Probabilistic information retrieval • Binary independence model • Non-binary independence models
Fuzzy Boolean Models • Limitations of the Boolean model • Fuzzy models • basic • MMM • Paice • p-norm
Designed to overcome: Language variability problem where a user expresses a concept with different words than those used in a document The multiple meaning of words Uses SVD or two-mode factor analysis Latent semantic indexing
Knowledge Based IR • Knowledge based information retrieval attempts to identify the occurrence of high level concepts in • Concepts and their relationship represent the knowledge needed for retrieval • Evidential reasoning provide the link between a document and its concepts
Inference Networks for IR • Turtle and Croft introduced the inference network model for information retrieval • This is a probability-based method • Ranks documents by probability of satisfying a user's information need.
Evaluation • Fallout • Recall and precision • 11 point recall/precision • Average precision
Building inverted files • Memory based • Sort based • Text partitioning • Lexical partitioning (FASTINV)
Signature file • Alternative to inverted index • A compressed representation of documents • Uses n-grams and hashing • Enable searching for prefix and part of words • No ranks • Techniques to increase efficiency
An alternative data structure to using inverted files • Patricia trees (also called suffix trees) • PAT arrays (also called suffix arrays)
Search engines • Robots and indexing • Using hypertext links to improve retrieval • PageRank - importance of documents • Hubs and Authorities • Webor
Metasearch Engine Two observations about search engines: • Web pages a user needs are frequently stored in multiple search engines. • The coverage of each search engine is limited. • Combining multiple search engines may increase the coverage. A metasearch engine is a good mechanism for solving these problems.
Metasearch Engines • Data selection problem • Query formulation problem • Result merging problem
Clustering • Some clustering algorithms • Document clustering • Term clustering • Cluster based retrieval
Phrases and Thesaurus • Usages • Phrase generation and recognition • Techniques for automatic building of corpus based thesaurus
Relevance feedback • The main idea • Issues • Query modification examples
Extracts/intelligent abstracts • IR Extracts are lists of fragments of text • IE extracts - extracts words/phrases to generate an abstract • Intelligent abstracts re-phrase content coherently (no redundant text, may use generalizations, etc.)
Themes, and text traversals • Text traversals provide a reader with a path of text excerpts • User can specify how large text traversal should be • The traversal can also be in response to a query