Dataware’s Document Categorization Toolkit

Dataware’s Document Categorization Toolkit John MunsonDataware Technologies1999 BRS User Group Conference

Categorization: A Popular Topic These Days • Automatically places documents into user-defined categories • Multiple uses • Categorize new documents being added to a database • batched or via interactive category suggestion • Enables browsing of database contents by category • Filter incoming documents for “personal pages” • Route incoming documents to appropriate offices for action

Dataware’s Categorization Toolkit • Consists of small API • half-dozen functions • Implemented for BRS and Dataware II engines • available via custom services • Now being integrated into KMS product • Now in production at US PTO (Patent and Trademark Office) • Handling 40,000+ categories • To route incoming patent applications to appropriate examiners

How It Works • Uses these components: • Our keyword-generation library • which is also in 6.3 keyword generation load filter • Term-weighted relevance ranking • Can optionally use our Natural Language Object Library (NLOL) • Creates a “profile” for each category • A BRS saved-search plus some other stuff • Runs each profile against new documents • Reports the n profiles that best matched each document

Profile Generation • Category profiles can be created: • Manually, using keywords or arbitrary searches • Automatically, using example documents • “Define-by-example” • Using an automatic/manual combination • Define-by-example is a big advantage • Allows novices to create profiles • Makes big problems solvable • such as PTO’s 40,000 categories -- impossible manually

Define-By-Example • Analyze example documents for each category, retrieved from a training database • Use our keyword library to extract up to 128 keywords • Each keyword becomes a search term • Each gets a weight -- to be used by the relevance ranking function • Keyword selection process likes words that: • are in many of the documents being analyzed • are rare in the database as a whole

Define-By-Example • After keyword selection, run the profile against the training database • Search on the keywords, rank the results • Analyze the resulting rank scores • Determine a score threshold for the profile • Threshold will determine whether each new document gets placed in the category • The goal is a threshold that: • is lower than the example documents’ scores • is higher than other documents’ scores

Categorizing New Documents • Run all profiles against new documents • Could be an entire database • Could be all documents after a given date • Could be a batch of incoming newsfeed/web/etc. documents loaded into a temp database • Could be a single document at submission time • Some documents are accepted by multiple profiles • Categorizer can compare those profiles and provide a ranked list for each document • Good for interactive category suggestion

Adaptive Categories • Applications can accept user feedback and adapt to new examples • “Refine-by-example” • Feedback can be active... • User acceptance/rejection of categorizer’s decisions • …or could be passive • Titles/summaries of documents selected for viewing from among suggested documents could be used as positive feedback

Flexibility • Categorizer can be configured to provide: • Higher recall (accept more documents into category, at a cost of more false hits) • Higher precision (few to no false hits, at a cost of missing some good documents) • A balance between the two • Total recall (all documents are accepted into some category, even if there’s no good match) • Useful for routing applications where at least one suggestion is required

Questions and Answers • Here’s a question: How do you come up with a set of categories in the first place? • Come to Friday’s talk about clustering for an answer!

Dataware’s Document Categorization Toolkit

Dataware’s Document Categorization Toolkit

Presentation Transcript

Requirements Document for the Banking System

Word Chapter 1

The Pediatric Environmental Health Toolkit Training Program for Health Care Providers 2006/2007

Creating, Using and Justifying the Auditor's Toolkit

WELCOME T-101 Document Processing - AUTHORIZATIONS

The Globus Toolkit™: and its application to GryPhyN

Tutorial 11 Creating XML Document

WELCOME T-101 Document Processing – AUTHORIZATIONS

Module 3 RDA Basics Using the RDA Toolkit

Information Governance Toolkit (IGT):

Text Categorization

Environmental Categorization and Screening of the DSL Substances

807 - TEXT ANALYTICS

Information Document 17-E

DOM (Document Object Model)

Spatial Data Mining Toolkit for Refining MSDS (aka TopoAssistant)

DOM (Document Object Model)

A Proteomics Toolkit:

A Physics Toolkit