550 likes | 563 Views
Learn about clustering methods for information retrieval, including heuristic clustering, one-pass assignments, and buckshot clustering. Explore the use of truncated document vectors and cluster-based retrieval. Discover the advantages and disadvantages of bottom-up and top-down search strategies. Also, understand the usage of thesauri in information retrieval and building word-based thesauri.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #19 April 11, 2000
Clustering using existing clusters • Start with a set of clusters • Compute centroids • Compare every item to centroids • Move to most similar cluster • Update centroids • Process ends when all items in best cluster
Example Before After
Heuristic clustering methods • Similarity matrix is not generated • Cluster depends on order of items • Inferior clusters but much faster • Can be used for incremental clustering
One pass assignments • Item 1 placed in cluster • Each subsequent item compared against all clusters (initially against item 1) • Placed into an existing cluster if similar enough • Otherwise into a new cluster
Heuristic clustering methods • Uses similarity between an item and centroid of existing clusters • When item added to cluster its centroid is updated
Buckshot Clustering • Goal: reasonably good k clusters, in O(kn) time (k a constant) • A random sample of documents is used to create k clusters in O(kn) time • Rest of documents are added to the “best” of the k clusters in O(kn) time
Clustering with truncated document vectors • Most expensive step in incremental clustering is computing distance between a new document and all clusters using cosine • Clustering can be done successfully with vectors that contain only a few terms • This will happen when latent semantic indexing is used • An alternative is to discard terms with weights below a threshold
Cluster based retrieval • Given a cluster hierarchy for a collection • A cluster is selected for retrieval when a query is similar to its centroid • The search can be done either top down or bottom up
Cluster Retrieval Issues • A cluster is selected for retrieval when: • The similarity between query and centroid is above a threshold • Who decides on the threshold? • What should it be? • Instead of threshold a user may limit number retrieved to n
Cluster Retrieval Issues • Should all the documents in a selected cluster be retrieved? • If yes, how will they be ranked? • Should each document in a selected cluster be compared to the query?
Cluster based retrieval • Advantage • Relevant documents which do not contain query terms may be retrieved • Retrieval may be fast if only centroids compared to query
Cluster based retrieval • Disadvantage • When whole cluster is returned precision may be low • Clusters may contain relevant documents even when a centroid is not similar to query
Cluster based retrieval • Disadvantages • When clusters are large too many documents may be retrieved • Comparing each document in a selected cluster to query is time consuming
Bottom up search • Query compared to each of the low level centroids (i.e., those that contain documents as well as clusters) • First, the best n low level clusters are selected • Each document in these clusters is compared to the query and the best n documents are chosen
Bottom up search 4 3 (.0) (.7) 5 6 7 9 8 (.8) (.3) (.6) (.5) (.1) A B C D E F G H I J K L M N (.8) (.5) (.3) (.4) (.2) (0) (0) (0) (.9) (.4) (.6) (.8) (.2) (.4) n=3. 3 best clusters are 8 (.8), 4 (.7) and 5 (.6). 3 best documents I (.9), L (.8) and A (.8)
Top down search • Best first search, clusters reached are put into a priority queue. • Search until a cluster with at most n documents is reached • All documents in the cluster are retrieved. • If more documents are needed the best first search continues
Top down search 1 (.2) 2 4 3 (.5) (.7) (.0) 5 6 7 9 8 (.8) (.3) (.6) (.5) (.1) A B C D E F G H I J K L M N (.8) (.5) (.3) (.4) (.2) (0) (0) (0) (.9) (.4) (.6) (.8) (.2) (.4) Cluster 1 is too big. (4 (.7), 2 (.5), 3 (0)) Cluster 4 is too big. (8 (.8), 2 (.5), 9 (.3), 3 (0)) I and J are retrieved. Then A and B.
Thesaurus • A general thesaurus • Domain specific thesauri • Usage in IR • Building a word bases thesaurus
A general thesaurus • A general thesaurus - contains synonyms and antonyms • Different word senses • Sometimes broader terms • Many domain specific terms not included (C++, OS/2, RAM, etc.)
A general thesaurus • Roget http://humanities.uchicago.edu/forms_unrest/ROGET.html • Word based • Provides related terms, and a broader term
Roget search for “car” • Vehicle (broader term) • car, auto, jalopy, clunker, lemon, flivver, coupe, sedan, two-door sedan, four-door sedan, luxury sedan; wheels [coll.], sports • car, roadster, gran turismo[It], jeep, four-wheel drive vehicle, electric • car, ...
A general thesaurus • http://dictionary.langenberg.com/ - Chicago thesaurus • WordNet - a lexical database for English • http://www.cogsci.princeton.edu/~wn • car - noun has 5 senses in WordNet • car - auto, automobile, machine, motorcar • car - railcar...
WordNet Evaluation • "Natural language processing is essential for dealing efficiently with the large quantities of text now available online: fact extraction and summarization, automated indexing and text categorization, and machine translation.”
WordNet Evaluation • “Another essential function is helping the user with query formulation through synonym relationships between words and hierarchical and other relationships between concepts. WordNet supports both of these functions and thus deserves careful study by the digital library community”
Domain specific thesauri • Keywords are usually phrases • Medical thesaurus • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed. • Law based thesaurus, etc.
A thesaurus for IR • Contains relations between concepts such as broader, narrower, related, etc. • Usually concepts form hierarchies • Thesaurus can be on-line or off-line • Library of Congress, for example, provides a thesaurus to determine search subjects and keywords
A thesaurus for IR - example computer aided instruction see also education UF (used for) teaching machines BT (broader term) educational computing NT (narrower term) TT (top term) computer application RT (related terms) education, teaching
How it is used • Manual indexing - term selection • Automatic indexing - set of synonyms represented by one term • Query formulation - helps users select query terms (important in controlled vocabulary)
How it is used • Query expansion - suggests terms to user • Broaden or narrow a query depending on retrieval results • Automatic query expansion
Building a thesaurus • Manual • Automatic • word based • phrase based
Manual generation • Domain experts • Information experts • Keywords selected • Hierarchy built • Expensive, time consuming and subjective
Automatically built Word based thesauri • Salton • Pederson • Crouch
Corpus-based word-based thesaurus (S) • (Salton 1971) • Main idea: When ti and tj often co-occur in the same documents, they are related. • The terms “trial”, “defendant”, “prosecution” and “judge” will tend to co-occur in the same documents
Corpus-based word-based thesaurus (S) • Uses the term/document weight matrix W • wi,k is the weight assigned to term i in document k • Computes a matrix T • The element in the ith row and jth column of T is the “relation” of term i to term j
Corpus-based word-based thesaurus (S) • N is the number of documents
Corpus-based word-based thesaurus (S) • Note that the nominator is the same for both T(ti, tj) and T(tj, ti) • The denominator however uses the total weight of term i in T(ti, tj), and the total weight of term j in T(tj, ti) • 0 - when no co-occurrence of terms • 1- when co-occurrence vector are equal
Corpus-based word-based thesaurus (S) • The values in the matrix are used to distinguish between, • broad, • narrow and • related terms
Corpus-based word-based thesaurus (S) • The system considers ti and tjrelated when both T(ti, tj) and T(tj, ti) are at least K. • K is a similarity threshold computed experimentally
Corpus-based word-based thesaurus (S) • Broad terms occur in documents more often than narrower ones • ex. “house”, “cottage” • “language”, “French” • The system considers tibroader than tj when T(tj, ti) >=K and T(ti,tj)<K
Example K=1/4, t3 and t4 are related t1 is broader than t2
Corpus-based word-based thesaurus (S) • The formula indicates high co-occurrence, and a larger normalizing factor for ti
Drawback (S) • Terms found related in this method may not be semantically related. • Different functions were also used for term/term similarity (for example inner product to evaluate how related are 2 terms)
Full text collections • Co-occurrence may not be meaningful in large documents which cover many topics • Use co-occurrence in document “window” instead of whole document
Full text collections • The idea is that co-occurrences of terms should be in a small part of the text • A “window” may be a few paragraphs or a “chunk” of q terms
Co-occurrence-based Thesaurus (P) • (Schutze and Pedersen 1997) • Another idea: words with similar meanings co-occur with similar neighbors • "litigation" and "lawsuit" share neighbors such as "court", “judge", “witness” and "proceedings"
Co-occurrence-based Thesaurus (P) • Matrix A is computed for terms that occur 2000-5000 times • Ai,j = number of times words i and j co-occur in collection in windows of size k=40 • These terms are clustered into 200 A-classes (average based clustering)
Co-occurrence-based Thesaurus (P) • A matrix B (200, 20,000) is generated for the 20,000 most frequent terms, based on their co-ocurrence in A-class clusters • Assume one A-class has gA1={ t1, t2, t3} and gA2={t4, t5}. • If term j co-occurs with: • t1 10 times, t2 5 times and t4 6 times • B[1, j]=15, and B[2, j]=6 • The 20,000 terms are now clustered in 200 B-classes (buckshot)
Co-occurrence-based Thesaurus (P) • Now C is formed for all terms • An entry C[i, j] indicates the number of times term j co-occurs with the B-classes • Now SVD is applied on a 200*t matrix • A document is represented by a vector that is the sum of the context vectors (columns in the SVD)