Scatter/Gather : A Cluster Based Approach to Large Document Collections

Scatter/Gather : A Cluster Based Approach to Large Document Collections Alyssa Katz LIS 551 March 23, 2003

Introduction • Alternate uses for document clustering • Give document clustering a second chance!

Old Approach • Compare Document Clustering with Vector Space Models • Cluster searches are for the most part inferior to VS searches • Document clustering algorithms are SLOW • CONCLUSION: Document clustering should only be used to the extent of accelerating VS searches

New Approach • Document Clustering is not bad, just misunderstood • The REAL question is: How can clustering be effective in its own right? • THE ANSWER: The “Scatter/Gather Method”

Specific information need User has good idea of keywords or search terms Faster, more pointed User wants more general info Is not familiar with the vocabulary, or doesn’t want to commit to a specific set of words User will sift through info to find what he wants Searching vs. Browsing

Solution • Use clustering to browse a system the way one would browse a table of contents • Have a function where user can alternate between browsing and searching

Scatter/Gather • User is presented with short summaries of a small number of document groups. • User selects one or more groups for further study • Continue this process until the individual document level

Example • 5000 Articles in the NYT News Service International News Kuwait and Germany and Oil Articles about effect of invasion on oil market, U.S. Military deployment in Kuwait Document

Requirements • New Algorithms • One that can appropriately cluster large document collections • One that can sufficiently generate summaries of these document collections

Solution • Buckshot algorithm for the first requirement • Employs a random sampling of clusters • Fractionation for the second requirement

Application to Scatter/Gather • Basically, clustering is done beforehand, and real time searches do not cluster from scratch • Real time searches just refine what already exists

Scatter/Gather : A Cluster Based Approach to Large Document Collections

Scatter/Gather : A Cluster Based Approach to Large Document Collections

Presentation Transcript

Text and Web Search

Extending the Longevity of NPS Photographic Collections

The University of Sunderland Cluster Computer

Cluster sampling

FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”)

SCATTER RADIATION

Collections

The Internal Audit Process - Risk-Based Process-Focused Audit Approach

Dr. Timothy Spangler The COMET Program

Cluster Analysis: Basic Concepts and Algorithms

To minimize any interruptions, please silence all electronic devices

Content-based Image Retrieval (CBIR)

Stata 教學

High Performance Cluster Computing Architectures and Systems

Applied Cryptography

Toro 1

Agenda

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Chapter 7. Cluster Analysis

SIMILARITY SEARCH The Metric Space Approach

SIMILARITY SEARCH The Metric Space Approach

SIMILARITY SEARCH The Metric Space Approach