CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #19 April 11, 2000

Clustering using existing clusters • Start with a set of clusters • Compute centroids • Compare every item to centroids • Move to most similar cluster • Update centroids • Process ends when all items in best cluster

Example Before After

Heuristic clustering methods • Similarity matrix is not generated • Cluster depends on order of items • Inferior clusters but much faster • Can be used for incremental clustering

One pass assignments • Item 1 placed in cluster • Each subsequent item compared against all clusters (initially against item 1) • Placed into an existing cluster if similar enough • Otherwise into a new cluster

Heuristic clustering methods • Uses similarity between an item and centroid of existing clusters • When item added to cluster its centroid is updated

Buckshot Clustering • Goal: reasonably good k clusters, in O(kn) time (k a constant) • A random sample of documents is used to create k clusters in O(kn) time • Rest of documents are added to the “best” of the k clusters in O(kn) time

Clustering with truncated document vectors • Most expensive step in incremental clustering is computing distance between a new document and all clusters using cosine • Clustering can be done successfully with vectors that contain only a few terms • This will happen when latent semantic indexing is used • An alternative is to discard terms with weights below a threshold

Cluster based retrieval • Given a cluster hierarchy for a collection • A cluster is selected for retrieval when a query is similar to its centroid • The search can be done either top down or bottom up

Cluster Retrieval Issues • A cluster is selected for retrieval when: • The similarity between query and centroid is above a threshold • Who decides on the threshold? • What should it be? • Instead of threshold a user may limit number retrieved to n

Cluster Retrieval Issues • Should all the documents in a selected cluster be retrieved? • If yes, how will they be ranked? • Should each document in a selected cluster be compared to the query?

Cluster based retrieval • Advantage • Relevant documents which do not contain query terms may be retrieved • Retrieval may be fast if only centroids compared to query

Cluster based retrieval • Disadvantage • When whole cluster is returned precision may be low • Clusters may contain relevant documents even when a centroid is not similar to query

Cluster based retrieval • Disadvantages • When clusters are large too many documents may be retrieved • Comparing each document in a selected cluster to query is time consuming

Bottom up search • Query compared to each of the low level centroids (i.e., those that contain documents as well as clusters) • First, the best n low level clusters are selected • Each document in these clusters is compared to the query and the best n documents are chosen

Bottom up search 4 3 (.0) (.7) 5 6 7 9 8 (.8) (.3) (.6) (.5) (.1) A B C D E F G H I J K L M N (.8) (.5) (.3) (.4) (.2) (0) (0) (0) (.9) (.4) (.6) (.8) (.2) (.4) n=3. 3 best clusters are 8 (.8), 4 (.7) and 5 (.6). 3 best documents I (.9), L (.8) and A (.8)

Top down search • Best first search, clusters reached are put into a priority queue. • Search until a cluster with at most n documents is reached • All documents in the cluster are retrieved. • If more documents are needed the best first search continues

Top down search 1 (.2) 2 4 3 (.5) (.7) (.0) 5 6 7 9 8 (.8) (.3) (.6) (.5) (.1) A B C D E F G H I J K L M N (.8) (.5) (.3) (.4) (.2) (0) (0) (0) (.9) (.4) (.6) (.8) (.2) (.4) Cluster 1 is too big. (4 (.7), 2 (.5), 3 (0)) Cluster 4 is too big. (8 (.8), 2 (.5), 9 (.3), 3 (0)) I and J are retrieved. Then A and B.

Thesaurus • A general thesaurus • Domain specific thesauri • Usage in IR • Building a word bases thesaurus

A general thesaurus • A general thesaurus - contains synonyms and antonyms • Different word senses • Sometimes broader terms • Many domain specific terms not included (C++, OS/2, RAM, etc.)

A general thesaurus • Roget http://humanities.uchicago.edu/forms_unrest/ROGET.html • Word based • Provides related terms, and a broader term

Roget search for “car” • Vehicle (broader term) • car, auto, jalopy, clunker, lemon, flivver, coupe, sedan, two-door sedan, four-door sedan, luxury sedan; wheels [coll.], sports • car, roadster, gran turismo[It], jeep, four-wheel drive vehicle, electric • car, ...

A general thesaurus • http://dictionary.langenberg.com/ - Chicago thesaurus • WordNet - a lexical database for English • http://www.cogsci.princeton.edu/~wn • car - noun has 5 senses in WordNet • car - auto, automobile, machine, motorcar • car - railcar...

WordNet Evaluation • "Natural language processing is essential for dealing efficiently with the large quantities of text now available online: fact extraction and summarization, automated indexing and text categorization, and machine translation.”

WordNet Evaluation • “Another essential function is helping the user with query formulation through synonym relationships between words and hierarchical and other relationships between concepts. WordNet supports both of these functions and thus deserves careful study by the digital library community”

Domain specific thesauri • Keywords are usually phrases • Medical thesaurus • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed. • Law based thesaurus, etc.

A thesaurus for IR • Contains relations between concepts such as broader, narrower, related, etc. • Usually concepts form hierarchies • Thesaurus can be on-line or off-line • Library of Congress, for example, provides a thesaurus to determine search subjects and keywords

A thesaurus for IR - example computer aided instruction see also education UF (used for) teaching machines BT (broader term) educational computing NT (narrower term) TT (top term) computer application RT (related terms) education, teaching

How it is used • Manual indexing - term selection • Automatic indexing - set of synonyms represented by one term • Query formulation - helps users select query terms (important in controlled vocabulary)

How it is used • Query expansion - suggests terms to user • Broaden or narrow a query depending on retrieval results • Automatic query expansion

Building a thesaurus • Manual • Automatic • word based • phrase based

Manual generation • Domain experts • Information experts • Keywords selected • Hierarchy built • Expensive, time consuming and subjective

Automatically built Word based thesauri • Salton • Pederson • Crouch

Corpus-based word-based thesaurus (S) • (Salton 1971) • Main idea: When ti and tj often co-occur in the same documents, they are related. • The terms “trial”, “defendant”, “prosecution” and “judge” will tend to co-occur in the same documents

Corpus-based word-based thesaurus (S) • Uses the term/document weight matrix W • wi,k is the weight assigned to term i in document k • Computes a matrix T • The element in the ith row and jth column of T is the “relation” of term i to term j

Corpus-based word-based thesaurus (S) • N is the number of documents

Corpus-based word-based thesaurus (S) • Note that the nominator is the same for both T(ti, tj) and T(tj, ti) • The denominator however uses the total weight of term i in T(ti, tj), and the total weight of term j in T(tj, ti) • 0 - when no co-occurrence of terms • 1- when co-occurrence vector are equal

Corpus-based word-based thesaurus (S) • The values in the matrix are used to distinguish between, • broad, • narrow and • related terms

Corpus-based word-based thesaurus (S) • The system considers ti and tjrelated when both T(ti, tj) and T(tj, ti) are at least K. • K is a similarity threshold computed experimentally

Corpus-based word-based thesaurus (S) • Broad terms occur in documents more often than narrower ones • ex. “house”, “cottage” • “language”, “French” • The system considers tibroader than tj when T(tj, ti) >=K and T(ti,tj)<K

Example

Example K=1/4, t3 and t4 are related t1 is broader than t2

Corpus-based word-based thesaurus (S) • The formula indicates high co-occurrence, and a larger normalizing factor for ti

Drawback (S) • Terms found related in this method may not be semantically related. • Different functions were also used for term/term similarity (for example inner product to evaluate how related are 2 terms)

Full text collections • Co-occurrence may not be meaningful in large documents which cover many topics • Use co-occurrence in document “window” instead of whole document

Full text collections • The idea is that co-occurrences of terms should be in a small part of the text • A “window” may be a few paragraphs or a “chunk” of q terms

Co-occurrence-based Thesaurus (P) • (Schutze and Pedersen 1997) • Another idea: words with similar meanings co-occur with similar neighbors • "litigation" and "lawsuit" share neighbors such as "court", “judge", “witness” and "proceedings"

Co-occurrence-based Thesaurus (P) • Matrix A is computed for terms that occur 2000-5000 times • Ai,j = number of times words i and j co-occur in collection in windows of size k=40 • These terms are clustered into 200 A-classes (average based clustering)

Co-occurrence-based Thesaurus (P) • A matrix B (200, 20,000) is generated for the 20,000 most frequent terms, based on their co-ocurrence in A-class clusters • Assume one A-class has gA1={ t1, t2, t3} and gA2={t4, t5}. • If term j co-occurs with: • t1 10 times, t2 5 times and t4 6 times • B[1, j]=15, and B[2, j]=6 • The 20,000 terms are now clustered in 200 B-classes (buckshot)

Co-occurrence-based Thesaurus (P) • Now C is formed for all terms • An entry C[i, j] indicates the number of times term j co-occurs with the B-classes • Now SVD is applied on a 200*t matrix • A document is represented by a vector that is the sum of the context vectors (columns in the SVD)

CS533 Information Retrieval