130 likes | 203 Views
Dataware’s Document Clustering and Query-By-Example Toolkits. John Munson Dataware Technologies 1999 BRS User Group Conference. Document Clustering. Automatically creates clusters of similar documents General benefit: provides an overview of the range of topics in a set Multiple specific uses
E N D
Dataware’s Document Clustering and Query-By-Example Toolkits John MunsonDataware Technologies1999 BRS User Group Conference
Document Clustering • Automatically creates clusters of similar documents • General benefit: provides an overview of the range of topics in a set • Multiple specific uses • Familiarization with database before searching • Familiarization with a result set after searching • Assistance in category definition for other uses • Category tree construction • FAQ construction
Dataware’s Clustering Toolkit • One API function • Source of documents is a BRS result set • which could be backref 0 for entire database • Can specify certain fields for analysis • Output indicates member documents for each cluster • Application can specify number and max/min size of clusters, etc. • US PTO (Patent and Trademark Office) plans to do category tree construction
How It Works • Extracts keywords from each document • using our keyword-generation library • which is also in 6.3 keyword generation load filter • Repeats these steps: • Compare document and cluster pairs using the keyword lists • How many keywords do two lists share, and how similar are their weights? • Combine the most similar pair into one cluster • Stops when n clusters remain (n is configurable)
How It Works • Output is a list of clusters, including: • a cluster quality score • Measures how cohesive the cluster is • a ranked list of keywords describing the cluster • a ranked list of member documents • Highest-ranked docs are the most “central”
Speed Tricks • Speed is a big issue in clustering • especially for interactive searching • Keyword extraction takes time • Pairwise comparisons don’t scale up well at all • Thus, we use a couple of speed tricks • One trick for database design • One trick inside the clustering function • Trick 1: Pre-generate keywords • Use the BRS 6.3 keyword generation load filter • The filter produces a keyword paragraph that looks like this...
Speed Tricks ..Keywords: compartment (187.80). mass (156.56). methylhistidine (118.12). ... • At clustering time, we don’t need to do keyword analysis • Just retrieve keyword lists from engine • Cuts execution time in half
Speed Tricks • Trick 2: Cluster a sample of the set (Cutting et al) • Create the desired number of clusters from a small sample • Then compare the remaining documents only to those few clusters, not to all other documents • Saves a huge amount of execution time • Another trick for result-set clustering: • Cluster only the top-ranked 100 to 1000 docs • A final speed note: CPU speed helps a lot • Clustering is very processor-intensive • 2x CPU speed gives almost 2x clustering speed
Query-By-Example (QBE) • Allows an example passage or document to serve as a query • Useful when we already have some text or a document about our topic • “Find more like this” • No query formulation required • QBE analyzes the text, then constructs and executes a query
Dataware’s QBE Toolkit • One API function • Source of example text can be: • a text buffer • e.g. text selected with mouse • a BRS document (or documents) from a result set • e.g. selected from a title list • Can specify certain fields for analysis • a word list with weights or occurrence counts • Output is a standard ranked document list
How It Works • Extracts keywords from the example text • using ... all together now ... our keyword-generation library, yet again • Keyword selection process likes words that: • occur frequently in the example text • are rare in the database as a whole • Getting database statistics can be done: • using field qualification - most accurate but slow • using no qualification - still good, much faster • not at all -- just use occurrence counts in example text -- fastest, but trickier
How It Works • Performs a ranked search using the keywords and their weights • Flexible fielding: • Analysis of example document(s) can use one set of BRS paragraphs • Search can use a different set • Speed trick: • Generate keyword field for database (load filter) • Field-level index it • Use it for QBE searches