1 / 11

Dataware’s Document Categorization Toolkit

Dataware’s Document Categorization Toolkit. John Munson Dataware Technologies 1999 BRS User Group Conference. Categorization: A Popular Topic These Days. Automatically places documents into user-defined categories Multiple uses Categorize new documents being added to a database

betty_james
Download Presentation

Dataware’s Document Categorization Toolkit

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dataware’s Document Categorization Toolkit John MunsonDataware Technologies1999 BRS User Group Conference

  2. Categorization: A Popular Topic These Days • Automatically places documents into user-defined categories • Multiple uses • Categorize new documents being added to a database • batched or via interactive category suggestion • Enables browsing of database contents by category • Filter incoming documents for “personal pages” • Route incoming documents to appropriate offices for action

  3. Dataware’s Categorization Toolkit • Consists of small API • half-dozen functions • Implemented for BRS and Dataware II engines • available via custom services • Now being integrated into KMS product • Now in production at US PTO (Patent and Trademark Office) • Handling 40,000+ categories • To route incoming patent applications to appropriate examiners

  4. How It Works • Uses these components: • Our keyword-generation library • which is also in 6.3 keyword generation load filter • Term-weighted relevance ranking • Can optionally use our Natural Language Object Library (NLOL) • Creates a “profile” for each category • A BRS saved-search plus some other stuff • Runs each profile against new documents • Reports the n profiles that best matched each document

  5. Profile Generation • Category profiles can be created: • Manually, using keywords or arbitrary searches • Automatically, using example documents • “Define-by-example” • Using an automatic/manual combination • Define-by-example is a big advantage • Allows novices to create profiles • Makes big problems solvable • such as PTO’s 40,000 categories -- impossible manually

  6. Define-By-Example • Analyze example documents for each category, retrieved from a training database • Use our keyword library to extract up to 128 keywords • Each keyword becomes a search term • Each gets a weight -- to be used by the relevance ranking function • Keyword selection process likes words that: • are in many of the documents being analyzed • are rare in the database as a whole

  7. Define-By-Example • After keyword selection, run the profile against the training database • Search on the keywords, rank the results • Analyze the resulting rank scores • Determine a score threshold for the profile • Threshold will determine whether each new document gets placed in the category • The goal is a threshold that: • is lower than the example documents’ scores • is higher than other documents’ scores

  8. Categorizing New Documents • Run all profiles against new documents • Could be an entire database • Could be all documents after a given date • Could be a batch of incoming newsfeed/web/etc. documents loaded into a temp database • Could be a single document at submission time • Some documents are accepted by multiple profiles • Categorizer can compare those profiles and provide a ranked list for each document • Good for interactive category suggestion

  9. Adaptive Categories • Applications can accept user feedback and adapt to new examples • “Refine-by-example” • Feedback can be active... • User acceptance/rejection of categorizer’s decisions • …or could be passive • Titles/summaries of documents selected for viewing from among suggested documents could be used as positive feedback

  10. Flexibility • Categorizer can be configured to provide: • Higher recall (accept more documents into category, at a cost of more false hits) • Higher precision (few to no false hits, at a cost of missing some good documents) • A balance between the two • Total recall (all documents are accepted into some category, even if there’s no good match) • Useful for routing applications where at least one suggestion is required

  11. Questions and Answers • Here’s a question: How do you come up with a set of categories in the first place? • Come to Friday’s talk about clustering for an answer!

More Related