Current Topics in Information Access: IR Background

Current Topics in Information Access:IR Background Marti Hearst Fall ‘98

Last Time • The problem of information access • Matching task and search type • Why text is tough • Current topics and issues

Today • Background on IR Basics • System architecture • Tokenization • Boolean queries • Term weighting and Ranking Algorithms • Inverted Indices • Evaluation

Some IR History • Roots in the scientific “Information Explosion” following WWII • Interest in computer-based IR from mid 1950’s • H.P. Luhn at IBM (1958) • Probabilistic models at Rand (Maron & Kuhns) (1960) • Boolean system development at Lockheed (‘60s) • Vector Space Model (Salton at Cornell 1965) • Statistical Weighting methods and theoretical advances (‘70s) • Refinements and Advances in application (‘80s) • User Interfaces, Large-scale testing and application (‘90s)

Information Retrieval • Task Statement Build a system that retrieves documents that users are likely to find relevant to their information needs.

Structure of an IR System Search Line Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Adapted from Soergel, p. 19 Potentially Relevant Documents

Query Parse User’s Information Need text input

Index Pre-process Collections

Index Query Parse Rank or Match Pre-process User’s Information Need Collections text input

Index Query Parse Query Reformulation Rank or Match Pre-process User’s Information Need Collections text input

Steps in a “typical IR System” • Document preprocessing • Tokenization • Stemming/Normalizing • Query processing • Tokenization • Interpretation of Syntax • (Query Expansion) • Retrieval of documents according to similarity to query • Presentation of Retrieval Results • Relevance Feedback/Query reformulation

Stemming and Morphological Analysis • Goal: “normalize” similar words • Morphology (“form” of words) • Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class • dog, dogs • Derivational Morphology • Derive one word from another, • Often change grammatical class • build, building; health, healthy

Automated Methods • Powerful multilingual tools exist for morphological analysis • PCKimmo, Xerox Lexical technology • Require a grammar and dictionary • Use “two-level” automata • Stemmers: • Very dumb rules work well (for English) • Porter Stemmer: Iteratively remove suffixes • Improvement: pass results through a lexicon • Use considered too expensive to store a dictionary

Errors Generated by Porter Stemmer (Krovetz 93)

Query Languages • A way to express the question (information need) • Types: • Boolean • Natural Language • Stylized Natural Language • Form-Based (GUI)

Simple query language: Boolean • Terms + Connectors • terms • words • normalized (stemmed) words • phrases • thesaurus terms • connectors • AND • OR • NOT

Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat AND Dog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash)

Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • Each of the following combinations works: • Cat x x x x x • Dog x x x x x • Collar x x x • Leash x x x x x

Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • None of the following combinations work: • Cat x x • Dog x x • Collar x x • Leash x x

Boolean Logic B A

Boolean Searching Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete “Measurement of the width of cracks in prestressed concrete beams” Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

Boolean Problems • Disjunctive (OR) queries lead to too many results, often off-target • Conjunctive (AND) queries lead to reduced, and commonly zero result • Not intuitive to most people

Complete expressiveness for any identifiable subset of collection Exact and simple to program The whole panoply of Boolean Algebra available Complex query syntax is often misunderstood (if understood at all) Problems of Null output and Information Overload Output is not ordered in any useful fashion Advantages and Disadvantage of the Boolean Model Advantages Disadvantages

Psuedo-Boolean Queries • A new notation, from web search • +cat dog +collar leash • Does not mean the same thing! • Need a way to group combinations. • Phrases: • “stray cat” AND “frayed collar” • +“stray cat” + “frayed collar”

Boolean Extensions • Fuzzy Logic • Adds weights to each term/concept • ta AND tb is interpreted as MIN(w(ta),w(tb)) • ta OR tb is interpreted as MAX (w(ta),w(tb)) • Proximity/Adjacency operators • Interpreted as additional constraints on Boolean AND

Ranking Algorithms • Assign weights to the terms in the query. • Assign weights to the terms in the documents. • Compare the weighted query terms to the weighted document terms. • Rank order the results.

Index Query Parse Rank or Match Pre-process User’s Information Need Collections text input

Indexing and Representation:The Vector Space Model • Document represented by a vector of terms • Words (or word stems) • Phrases (e.g. computer science) • Removes words on “stop list” • Documents aren’t about “the” • Often assumed that terms are uncorrelated. • Correlations between term vectors implies a similarity between documents. • For efficiency, an inverted index of terms is often stored.

Document RepresentationWhat values to use for terms • Boolean (term present /absent) • tf(term frequency) - Count of times term occurs in document. • The more times a term t occurs in document d the more likely it is that t is relevant to the document. • Used alone, favors common words, long documents. • dfdocument frequency • The more a term t occurs throughout all documents, the more poorly t discriminates between documents • tf-idf term frequency * inverse document frequency - • High value indicates that the word occurs more often in this document than average.

Vector Representation • Documents and Queries are represented as vectors. • Position 1 corresponds to term 1, position 2 to term 2, position t to term t

Document Vectors Document ids nova galaxy heat h’wood film role diet fur 1.0 0.5 0.3 0.5 1.0 1.0 0.8 0.7 0.9 1.0 0.5 1.0 1.0 0.9 1.0 0.5 0.7 0.9 0.6 1.0 0.3 0.2 0.8 0.7 0.5 0.1 0.3 A B C D E F G H I

Assigning Weights • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole

Assigning Weights • tf x idf measure: • term frequency (tf) • inverse document frequency (idf)

tf x idf • Normalize the term weights (so longer documents are not unfairly given more weight)

tf x idf normalization • Normalize the term weights (so longer documents are not unfairly given more weight) • normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Vector Space Similarity Measurecombine tf x idf into a similarity measure

Computing Similarity Scores 1.0 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1.0

Documents in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2

Computing a similarity score

Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization that having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms

Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Relies on accurate estimates of probabilities for accurate results

Probabilistic Retrieval • Goes back to 1960’s (Maron and Kuhns) • Robertson’s “Probabilistic Ranking Principle” • Retrieved documents should be ranked in decreasing probability that they are relevant to the user’s query. • How to estimate these probabilities? • Several methods (Model 1, Model 2, Model 3) with different emphases on how estimates are done.

Probabilistic Models: Some Notation • D = All present and future documents • Q = All present and future queries • (Di,Qj) = A document query pair • x = class of similar documents, • y = class of similar queries, • Relevance is a relation:

Probabilistic Models: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown next

Probabilistic Models: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

Strong theoretical basis In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on-going collection of relevance information Probabilistic Models Advantages Disadvantages

Vector and Probabilistic Models • Support “natural language” queries • Treat documents and queries the same • Support relevance feedback searching • Support ranked retrieval • Differ primarily in theoretical basis and in how the ranking is calculated • Vector assumes relevance • Probabilistic relies on relevance judgments or estimates

Simple Presentation of Results • Order by similarity • Decreased order of presumed relevance • Items retrieved early in search may help generate feedback by relevance feedback • Select top k documents • Select documents within of query

Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization that having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms

Current Topics in Information Access: IR Background