750 likes | 757 Views
This course covers the principles and techniques of information organization and retrieval systems, including storage, indexing, querying, and retrieval models. Topics include Boolean retrieval, vector space model, and probabilistic models.
E N D
SIMS 202Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000
Midterm Review (Most slides taken from earlier lectures)
Storage Line Interest profiles & Queries Documents & data Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents Structure of an IR System Search Line Adapted from Soergel, p. 19
Repositories Goals Workspace Search is an Iterative Process
Cognitive (Human) Aspects of Information Access and Retrieval • “Finding Out About” (FOA) • types of information needs • specifying information needs (queries) • the process of information access • search strategies • “sensemaking” • Relevance • Modeling the User
Retrieval Models • Boolean Retrieval • Ranked Retrieval • Vector Space Model • Probabilistic Models
Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • Each of the following combinations satisfies this statement: • Cat x x x x x • Dog x x x x • Collar x x x x • Leash x x x x x
Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) • None of the following combinations work: • Cat x • Dog x x • Collar x x • Leash x x
Boolean Queries • Usually expressed as INFIX operators in IR • ((a AND b) OR (c AND b)) • NOT is UNARY PREFIX operator • ((a AND b) OR (c AND (NOT b))) • AND and OR can be n-ary operators • (a AND b AND c AND d) • Some rules - (De Morgan revisited) • NOT(a) AND NOT(b) = NOT(a OR b) • NOT(a) OR NOT(b)= NOT(a AND b) • NOT(NOT(a)) = a
Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracksANDbeams ANDWidth_measurement ANDPrestressed_concrete Cracks Width measurement Beams Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Prestressed concrete
Boolean Logic t1 t2 D9 D2 D1 m5 m3 m6 m1= t1t2t3 D11 D4 m2= t1 t2t3 D5 m3 = t1 t2t3 D3 m1 D6 m4= t1t2t3 m2 m4 D10 m5 = t1t2t3 m6 = t1t2t3 m7 m8 m7 = t1t2t3 D8 D7 m8= t1t2t3 t3
Precedence Ordering • In what order do we evaluate the components of the Boolean expression? • Parenthesis get done first • (a or b) and (c or d) • (a or (b and c) or d) • Usually start from the left and work right (in case of ties) • Usually (if there are no parentheses) • NOT before AND • AND before OR
Boolean Problems • Disjunctive (OR) queries lead to information overload • Conjunctive (AND) queries lead to reduced, and commonly zero result • Conjunctive queries imply reduction in Recall
Vector Space Model • Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents
Vector Space with Term Weights and Cosine Matching Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) Term B 1.0 Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A
Assigning Weights to Terms • Binary Weights • Raw term frequency • tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole • Automatically derived thesaurus terms
Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector
Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector
Assigning Weights • tf x idf measure: • term frequency (tf) • inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution • Goal: assign a tf * idf weight to each term in each document
Inverse Document Frequency • IDF provides high values for rare words and low values for common words
tf x idf normalization • Normalize the term weights (so longer documents are not unfairly given more weight) • normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.
Vector space similarity(use the weights to compare the documents)
Vector Space Similarity Measurecombine tf x idf into a similarity measure
Relevance Feedback • aka query modification • aka “more like this”
Query Modification • Problem: how to reformulate the query? • Thesaurus expansion: • Suggest terms similar to query terms • Relevance feedback: • Suggest terms (and documents) similar to retrieved documents that have been judged to be relevant
Relevance Feedback • Main Idea: • Modify existing query based on relevance judgements • Extract terms from relevant documents and add them to the query • and/or re-weight the terms already in the query • Two main approaches: • Automatic (psuedo-relevance feedback) • Users select relevant documents • Users/system select terms from an automatically-generated list
Relevance Feedback • Usually do both: • expand query with new terms • re-weight terms in query • There are many variations • usually positive weights for terms from relevant docs • sometimes negative weights for terms from non-relevant docs
Rocchio Method • Rocchio automatically • re-weights terms • adds in new terms (from relevant docs) • have to be careful when using negative terms • Rocchio is not a machine learning algorithm • Most methods perform similarly • results heavily dependent on test collection • Machine learning methods are proving to work better than standard IR approaches like Rocchio
Using Relevance Feedback • Known to improve results • in TREC-like conditions (no user involved) • What about with a user in the loop? • How might you measure this? • Let’s examine a user study of relevance feedback by Koenneman & Belkin 1996.
Content Analysis • Automated Transformation of raw text into a form that represent some aspect(s) of its meaning • Including, but not limited to: • Automated Thesaurus Generation • Phrase Detection • Categorization • Clustering • Summarization
Techniques for Content Analysis • Statistical • Single Document • Full Collection • Linguistic • Syntactic • Semantic • Pragmatic • Knowledge-Based (Artificial Intelligence) • Hybrid (Combinations)
Text Processing • Standard Steps: • Recognize document structure • titles, sections, paragraphs, etc. • Break into tokens • usually space and punctuation delineated • special issues with Asian languages • Stemming/morphological analysis • Store in inverted index (to be discussed later)
Document Processing Steps Figure from Baeza-Yates & Ribeiro-Neto
Statistical Properties of Text • Token occurrences in text are not uniformly distributed • They are also not normally distributed • They do exhibit a Zipf distribution
Plotting Word Frequency by Rank • Main idea: count • How many tokens occur 1 time • How many tokens occur 2 times • How many tokens occur 3 times … • Now rank these according to how of they occur. This is called the rank.
Plotting Word Frequency by Rank • Say for a text with 100 tokens • Count • How many tokens occur 1 time (50) • How many tokens occur 2 times (20) … • How many tokens occur 7 times (10) … • How many tokens occur 12 times (1) • How many tokens occur 14 times (1) • So things that occur the most time shave the highest rank (rank 1). • Things that occur the fewest times have the lowest rank (rank n).
Observation: MANY phenomena can be characterized this way. • Words in a text collection • Library book checkout patterns • Incoming Web Page Requests (Nielsen) • Outgoing Web Page Requests (Cunha & Crovella) • Document Size on Web (Cunha & Crovella)
Zipf Distribution(linear and log scale) Illustration by Jacob Nielsen
Zipf Distribution • The product of the frequency of words (f) and their rank (r) is approximately constant • Rank = order of words’ frequency of occurrence • Another way to state this is with an approximately correct rule of thumb: • Say the most common term occurs C times • The second most common occurs C/2 times • The third most common occurs C/3 times • …
Zipf Distribution • The Important Points: • a few elements occur veryfrequently • a medium number of elements have medium frequency • manyelements occur very infrequently
Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.
Consequences of Zipf • There are always a few very frequent tokens that are not good discriminators. • Called “stop words” in IR • Usually correspond to linguistic notion of “closed-class” words • English examples: to, from, on, and, the, ... • Grammatical classes that don’t take on new members. • There are always a large number of tokens that occur almost once and can mess up algorithms. • Medium frequency words most descriptive
Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows
How Are Inverted Files Created • Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight
How Inverted Files are Created • After all documents have been parsed the inverted file is sorted alphabetically.
How InvertedFiles are Created • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled.
How Inverted Files are Created • Then the file can be split into • A Dictionary file and • A Postingsfile