660 likes | 816 Views
Vector and Probabilistic Ranking. Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval. Review. Inverted files The Vector Space Model Term weighting. Inverted indexes.
E N D
Vector and Probabilistic Ranking Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Information Organization and Retrieval
Review • Inverted files • The Vector Space Model • Term weighting Information Organization and Retrieval
Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • Also used for statistical ranking algorithms Information Organization and Retrieval
How Inverted Files are Used Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings Information Organization and Retrieval
Vector Space Model • Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents Information Organization and Retrieval
Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning. Information Organization and Retrieval
Vector Space Documentsand Queries t1 t3 D2 D9 D1 D4 D11 D5 D3 D6 D10 D8 t2 D7 Boolean term combinations Q is a query – also represented as a vector Information Organization and Retrieval
Documents in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2 Information Organization and Retrieval
Assigning Weights to Terms • Binary Weights • Raw term frequency • tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole • Automatically derived thesaurus terms Information Organization and Retrieval
Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector Information Organization and Retrieval
Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector Information Organization and Retrieval
Assigning Weights • tf x idf measure: • term frequency (tf) • inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution • Goal: assign a tf * idf weight to each term in each document Information Organization and Retrieval
tf x idf Information Organization and Retrieval
Inverse Document Frequency • IDF provides high values for rare words and low values for common words For a collection of 10000 documents Information Organization and Retrieval
Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient Information Organization and Retrieval
Vector Space Visualization Information Organization and Retrieval
Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2 Information Organization and Retrieval
K-Means Clustering • 1 Create a pair-wise similarity measure • 2 Find K centers using agglomerative clustering • take a small sample • group bottom up until K groups found • 3 Assign each document to nearest center, forming new clusters • 4 Repeat 3 as necessary Information Organization and Retrieval
Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93 Hearst & Pedersen 95 • Cluster sets of documents into general “themes”, like a table of contents • Display the contents of the clusters by showing topical terms andtypical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes” Information Organization and Retrieval
S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols 47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated Information Organization and Retrieval
Another use of clustering • Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. • “Project” these onto a 2D graphical representation: Information Organization and Retrieval
Clustering Multi-Dimensional Document Space(image from Wise et al 95) Information Organization and Retrieval
Clustering Multi-Dimensional Document Space(image from Wise et al 95) Information Organization and Retrieval
Concept “Landscapes” Disease Pharmocology Anatomy Legal Hospitals • (e.g., Lin, Chen, Wise et al.) • Too many concepts, or too coarse • Single concept per document • No titles • Browsing without search Information Organization and Retrieval
Clustering • Advantages: • See some main themes • Disadvantage: • Many ways documents could group together are hidden • Thinking point: what is the relationship to classification systems and facets? Information Organization and Retrieval
Today • Vector Space Ranking • Probabilistic Models and Ranking • (lots of math) Information Organization and Retrieval
tf x idf normalization • Normalize the term weights (so longer documents are not unfairly given more weight) • normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive. Information Organization and Retrieval
Vector space similarity(use the weights to compare the documents) Information Organization and Retrieval
Vector Space Similarity Measurecombine tf x idf into a similarity measure Information Organization and Retrieval
To Think About • How does this ranking algorithm behave? • Make a set of hypothetical documents consisting of terms and their weights • Create some hypothetical queries • How are the documents ranked, depending on the weights of their terms and the queries’ terms? Information Organization and Retrieval
Computing Similarity Scores 1.0 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1.0 Information Organization and Retrieval
Computing a similarity score Information Organization and Retrieval
Vector Space with Term Weights and Cosine Matching Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) Term B 1.0 Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A Information Organization and Retrieval
Weighting schemes • We have seen something of • Binary • Raw term weights • TF*IDF • There are many other possibilities • IDF alone • Normalized term frequency Information Organization and Retrieval
Term Weights in SMART • SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell. • Designed for laboratory experiments in IR • Easy to mix and match different weighting methods • Really terrible user interface • Intended for use by code hackers (and even they have trouble using it) Information Organization and Retrieval
Term Weights in SMART • In SMART weights are decomposed into three factors: Information Organization and Retrieval
SMART Freq Components Binary maxnorm augmented log Information Organization and Retrieval
Collection Weighting in SMART Inverse squared probabilistic frequency Information Organization and Retrieval
Term Normalization in SMART sum cosine fourth max Information Organization and Retrieval
Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization that having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms Information Organization and Retrieval
Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Rely on accurate estimates of probabilities Information Organization and Retrieval
Probability Ranking Principle • If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. Stephen E. Robertson, J. Documentation 1977 Information Organization and Retrieval
Model 1 – Maron and Kuhns • Concerned with estimating probabilities of relevance at the point of indexing: • If a patron came with a request using term ti, what is the probability that she/he would be satisfied with document Dj? Information Organization and Retrieval
Bayes Formula • Bayesian statistical inference used in both models… Information Organization and Retrieval
Model 1 Bayes • A is the class of events of using the library • Di is the class of events of Document i being judged relevant • Ij is the class of queries consisting of the single term Ij • P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved Information Organization and Retrieval
Model 2 – Robertson & Sparck Jones Given a term t and a query q Document Relevance + - + r n-r n - R-r N-n-R+r N-n R N-R N Document indexing Information Organization and Retrieval
Robertson-Spark Jones Weights • Retrospective formulation -- Information Organization and Retrieval