120 likes | 385 Views
Scatter/Gather : A Cluster Based Approach to Large Document Collections. Alyssa Katz LIS 551 March 23, 2003. Introduction. Alternate uses for document clustering Give document clustering a second chance!. Old Approach. Compare Document Clustering with Vector Space Models
E N D
Scatter/Gather : A Cluster Based Approach to Large Document Collections Alyssa Katz LIS 551 March 23, 2003
Introduction • Alternate uses for document clustering • Give document clustering a second chance!
Old Approach • Compare Document Clustering with Vector Space Models • Cluster searches are for the most part inferior to VS searches • Document clustering algorithms are SLOW • CONCLUSION: Document clustering should only be used to the extent of accelerating VS searches
New Approach • Document Clustering is not bad, just misunderstood • The REAL question is: How can clustering be effective in its own right? • THE ANSWER: The “Scatter/Gather Method”
Specific information need User has good idea of keywords or search terms Faster, more pointed User wants more general info Is not familiar with the vocabulary, or doesn’t want to commit to a specific set of words User will sift through info to find what he wants Searching vs. Browsing
Solution • Use clustering to browse a system the way one would browse a table of contents • Have a function where user can alternate between browsing and searching
Scatter/Gather • User is presented with short summaries of a small number of document groups. • User selects one or more groups for further study • Continue this process until the individual document level
Example • 5000 Articles in the NYT News Service International News Kuwait and Germany and Oil Articles about effect of invasion on oil market, U.S. Military deployment in Kuwait Document
Requirements • New Algorithms • One that can appropriately cluster large document collections • One that can sufficiently generate summaries of these document collections
Solution • Buckshot algorithm for the first requirement • Employs a random sampling of clusters • Fractionation for the second requirement
Application to Scatter/Gather • Basically, clustering is done beforehand, and real time searches do not cluster from scratch • Real time searches just refine what already exists