110 likes | 255 Views
Dataware’s Document Categorization Toolkit. John Munson Dataware Technologies 1999 BRS User Group Conference. Categorization: A Popular Topic These Days. Automatically places documents into user-defined categories Multiple uses Categorize new documents being added to a database
E N D
Dataware’s Document Categorization Toolkit John MunsonDataware Technologies1999 BRS User Group Conference
Categorization: A Popular Topic These Days • Automatically places documents into user-defined categories • Multiple uses • Categorize new documents being added to a database • batched or via interactive category suggestion • Enables browsing of database contents by category • Filter incoming documents for “personal pages” • Route incoming documents to appropriate offices for action
Dataware’s Categorization Toolkit • Consists of small API • half-dozen functions • Implemented for BRS and Dataware II engines • available via custom services • Now being integrated into KMS product • Now in production at US PTO (Patent and Trademark Office) • Handling 40,000+ categories • To route incoming patent applications to appropriate examiners
How It Works • Uses these components: • Our keyword-generation library • which is also in 6.3 keyword generation load filter • Term-weighted relevance ranking • Can optionally use our Natural Language Object Library (NLOL) • Creates a “profile” for each category • A BRS saved-search plus some other stuff • Runs each profile against new documents • Reports the n profiles that best matched each document
Profile Generation • Category profiles can be created: • Manually, using keywords or arbitrary searches • Automatically, using example documents • “Define-by-example” • Using an automatic/manual combination • Define-by-example is a big advantage • Allows novices to create profiles • Makes big problems solvable • such as PTO’s 40,000 categories -- impossible manually
Define-By-Example • Analyze example documents for each category, retrieved from a training database • Use our keyword library to extract up to 128 keywords • Each keyword becomes a search term • Each gets a weight -- to be used by the relevance ranking function • Keyword selection process likes words that: • are in many of the documents being analyzed • are rare in the database as a whole
Define-By-Example • After keyword selection, run the profile against the training database • Search on the keywords, rank the results • Analyze the resulting rank scores • Determine a score threshold for the profile • Threshold will determine whether each new document gets placed in the category • The goal is a threshold that: • is lower than the example documents’ scores • is higher than other documents’ scores
Categorizing New Documents • Run all profiles against new documents • Could be an entire database • Could be all documents after a given date • Could be a batch of incoming newsfeed/web/etc. documents loaded into a temp database • Could be a single document at submission time • Some documents are accepted by multiple profiles • Categorizer can compare those profiles and provide a ranked list for each document • Good for interactive category suggestion
Adaptive Categories • Applications can accept user feedback and adapt to new examples • “Refine-by-example” • Feedback can be active... • User acceptance/rejection of categorizer’s decisions • …or could be passive • Titles/summaries of documents selected for viewing from among suggested documents could be used as positive feedback
Flexibility • Categorizer can be configured to provide: • Higher recall (accept more documents into category, at a cost of more false hits) • Higher precision (few to no false hits, at a cost of missing some good documents) • A balance between the two • Total recall (all documents are accepted into some category, even if there’s no good match) • Useful for routing applications where at least one suggestion is required
Questions and Answers • Here’s a question: How do you come up with a set of categories in the first place? • Come to Friday’s talk about clustering for an answer!