SIMS 202 Information Organization and Retrieval

SIMS 202Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000

Midterm Review (Most slides taken from earlier lectures)

Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents Structure of an IR System Search Line Adapted from Soergel, p. 19

Repositories Goals Workspace Search is an Iterative Process

Cognitive (Human) Aspects of Information Access and Retrieval • “Finding Out About” (FOA) • types of information needs • specifying information needs (queries) • the process of information access • search strategies • “sensemaking” • Relevance • Modeling the User

Retrieval Models • Boolean Retrieval • Ranked Retrieval • Vector Space Model • Probabilistic Models

Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • Each of the following combinations satisfies this statement: • Cat x x x x x • Dog x x x x • Collar x x x x • Leash x x x x x

Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • None of the following combinations work: • Cat x • Dog x x • Collar x x • Leash x x

Boolean Queries • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a

Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracksANDbeams ANDWidth_measurement ANDPrestressed_concrete Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete

Boolean Logic t1 t2 D9 D2 D1 m5 m3 m6 m1= t1t2t3 D11 D4 m2= t1 t2t3 D5 m3 = t1 t2t3 D3 m1 D6 m4= t1t2t3 m2 m4 D10 m5 = t1t2t3 m6 = t1t2t3 m7 m8 m7 = t1t2t3 D8 D7 m8= t1t2t3 t3

Precedence Ordering • In what order do we evaluate the components of the Boolean expression? • Parenthesis get done first • (a or b) and (c or d) • (a or (b and c) or d) • Usually start from the left and work right (in case of ties) • Usually (if there are no parentheses) • NOT before AND • AND before OR

Boolean Problems • Disjunctive (OR) queries lead to information overload • Conjunctive (AND) queries lead to reduced, and commonly zero result • Conjunctive queries imply reduction in Recall

Vector Space Model • Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

Vector Space with Term Weights and Cosine Matching Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) Term B 1.0 Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A

Assigning Weights to Terms • Binary Weights • Raw term frequency • tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole • Automatically derived thesaurus terms

Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector

Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector

Assigning Weights • tf x idf measure: • term frequency (tf) • inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution • Goal: assign a tf * idf weight to each term in each document

tf x idf

Inverse Document Frequency • IDF provides high values for rare words and low values for common words

tf x idf normalization • Normalize the term weights (so longer documents are not unfairly given more weight) • normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.

Vector space similarity(use the weights to compare the documents)

Vector Space Similarity Measurecombine tf x idf into a similarity measure

Relevance Feedback • aka query modification • aka “more like this”

Query Modification • Problem: how to reformulate the query? • Thesaurus expansion: • Suggest terms similar to query terms • Relevance feedback: • Suggest terms (and documents) similar to retrieved documents that have been judged to be relevant

Relevance Feedback • Main Idea: • Modify existing query based on relevance judgements • Extract terms from relevant documents and add them to the query • and/or re-weight the terms already in the query • Two main approaches: • Automatic (psuedo-relevance feedback) • Users select relevant documents • Users/system select terms from an automatically-generated list

Relevance Feedback • Usually do both: • expand query with new terms • re-weight terms in query • There are many variations • usually positive weights for terms from relevant docs • sometimes negative weights for terms from non-relevant docs

Rocchio Method

Rocchio Method • Rocchio automatically • re-weights terms • adds in new terms (from relevant docs) • have to be careful when using negative terms • Rocchio is not a machine learning algorithm • Most methods perform similarly • results heavily dependent on test collection • Machine learning methods are proving to work better than standard IR approaches like Rocchio

Using Relevance Feedback • Known to improve results • in TREC-like conditions (no user involved) • What about with a user in the loop? • How might you measure this? • Let’s examine a user study of relevance feedback by Koenneman & Belkin 1996.

Content Analysis • Automated Transformation of raw text into a form that represent some aspect(s) of its meaning • Including, but not limited to: • Automated Thesaurus Generation • Phrase Detection • Categorization • Clustering • Summarization

Techniques for Content Analysis • Statistical • Single Document • Full Collection • Linguistic • Syntactic • Semantic • Pragmatic • Knowledge-Based (Artificial Intelligence) • Hybrid (Combinations)

Text Processing • Standard Steps: • Recognize document structure • titles, sections, paragraphs, etc. • Break into tokens • usually space and punctuation delineated • special issues with Asian languages • Stemming/morphological analysis • Store in inverted index (to be discussed later)

Document Processing Steps Figure from Baeza-Yates & Ribeiro-Neto

Statistical Properties of Text • Token occurrences in text are not uniformly distributed • They are also not normally distributed • They do exhibit a Zipf distribution

Plotting Word Frequency by Rank • Main idea: count • How many tokens occur 1 time • How many tokens occur 2 times • How many tokens occur 3 times … • Now rank these according to how of they occur. This is called the rank.

Plotting Word Frequency by Rank • Say for a text with 100 tokens • Count • How many tokens occur 1 time (50) • How many tokens occur 2 times (20) … • How many tokens occur 7 times (10) … • How many tokens occur 12 times (1) • How many tokens occur 14 times (1) • So things that occur the most time shave the highest rank (rank 1). • Things that occur the fewest times have the lowest rank (rank n).

Observation: MANY phenomena can be characterized this way. • Words in a text collection • Library book checkout patterns • Incoming Web Page Requests (Nielsen) • Outgoing Web Page Requests (Cunha & Crovella) • Document Size on Web (Cunha & Crovella)

Zipf Distribution(linear and log scale) Illustration by Jacob Nielsen

Zipf Distribution • The product of the frequency of words (f) and their rank (r) is approximately constant • Rank = order of words’ frequency of occurrence • Another way to state this is with an approximately correct rule of thumb: • Say the most common term occurs C times • The second most common occurs C/2 times • The third most common occurs C/3 times • …

Zipf Distribution • The Important Points: • a few elements occur veryfrequently • a medium number of elements have medium frequency • manyelements occur very infrequently

Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.

Consequences of Zipf • There are always a few very frequent tokens that are not good discriminators. • Called “stop words” in IR • Usually correspond to linguistic notion of “closed-class” words • English examples: to, from, on, and, the, ... • Grammatical classes that don’t take on new members. • There are always a large number of tokens that occur almost once and can mess up algorithms. • Medium frequency words most descriptive

Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created • Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created • After all documents have been parsed the inverted file is sorted alphabetically.

How InvertedFiles are Created • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled.

How Inverted Files are Created • Then the file can be split into • A Dictionary file and • A Postingsfile

SIMS 202 Information Organization and Retrieval