S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management

S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management ARMA International Conference 2001 R. Kirk Lubbes, CRM, CDIA, MIT President Records Engineering, LLC klubbes@recordsengineering.com 30 September 2001

Automatic Categorization Definition Automatic categorization provides the potential means to automatically file records in pre-established file plans or taxonomies, increasing the ease for correlating information

Why Automatic Categorization? • Computers produce over 1,000,000,000 pages of output per day in the United States. Laid end to end, it would encircle the earth 20 times. • 7,300,000 web pages are added to the Internet per day. • 5,000,000,000 e-mail messages are sent per day. INFORMATION OVERLOAD Too much information, coming too fast to manage manually

Understanding How Automatic Categorization Works • Requires a basic understanding of: • Feature Extraction • Clustering • Data Visualization • Summarization • Training Sets

Structured vs. Unstructured Data Structured Data: Fielded data, data that is generally stored in a relational database, e.g. metadata or indexing parameters Unstructured Data: Data not contained in fields, e.g. free text documents, images, audio, video Computers were designed to process structured data. Computers do not process unstructured data effectively

A Few (hopefully) Helpful Definitions A feature is a prominent, striking, or conspicuous characteristic A feature set is a collection of features related to an object, e.g. a text document A vector is a physical quantity that has magnitude and direction A feature vector is a vector which represents a feature set, i.e., where each dimension of the vector is a different feature, i.e., a different characteristic of the object (text document) which it represents

Feature 1 Feature 2 . . . Feature n-1 Feature n Feature Extraction Unstructured • Parsing the document to create a list of words or phrases • Eliminating stop words • Stemming (eliminating prefixes and suffixes) • Selecting descriptive features • Generating Feature Vectors Feature Vector (Structured Representation of Document) Structured

Feature Vectors: An Example from a Simple Corpus

Visualizing Feature Vectors V5 V3 Nuclear Energy Axis V4 V1 V5* Records Control Plan Axis V2 Records Inventory Axis

What Does This Have To Do with Records Management? STICK WITH ME, I AM GETTING THERE

Clusters and Centroids A cluster is a group of objects whose members are more similar to each other than to the members of any other group. When document feature vectors endpoints are clustered, they indicate that the documents are related to the same topic. A centroid is the center point of a cluster. Each topic has an associated centroid. All documents whose feature-vector’s endpoints are near a given topic’s centroid are related to that topic

Radius and Training Set The Radius is the distance from the centroid which will be used to determine whether the given feature vector will be associated with the given centroid. The radius may be user defined or defined by the categorization software.

Training an Automatic Classification System A training set is a collection of documents that are selected as representative of a subject heading in a file plan or a taxonomy. Training sets are used to calculate the subject heading’s associated centroid. If a document’s feature vector endpoint is near a given subject heading’s centroid, the document is assumed to be about that subject and is filed under its heading

Data Visualization ExampleThemeScape by Cartia • Shows what information is available in a collection of documents without reading the documents. • Every document and/or web page is organized onto a topographical map based on the information it contains. • Powerful search capabilities highlight relevant documents on the map. • Requires no manual categorizations or document tagging. • It reads all the documents in a collection and organizes them onto an information map. Courtesy: Cartia, Inc.

ThemeScape Information Map Courtesy: Cartia Inc

Legend for ThemeScapeInformation Map Courtesy: Cartia, Inc.

GalaxiesPacific Northwest National LaboratoryUnited States Department of Energy

ThemeviewPacific Northwest National LaboratoryUnited States Department of Energy

Summarization ExampleConvera’s (Excalibur’s) RetrievalWare Courtesy: Excalibur Technologies

Automatic Categorization Example:Autonomy’s Knowledge Server Automates the categorization of large volumes of both internal and external information (support over 200 formats) Supports categorizer training through examples Automatically inserts hypertext links to related information Presents a unified view of disparate information sources and shows how information is related Courtesy: Autonomy Systems Ltd.

Courtesy of Autonomy

Pre-established Taxonomy:PCDocs’ ETOC • Enterprise Table of Contents (ETOC) • Automatically categorizes new documents in existing taxonomies • Allow browsing by subject category and searching in context • Promotes Knowledge Discovery • Result List Clustering (RLC) • Group documents in a result list to discover micro structures in documents Courtesy: PCDocs/Fulcrum

Pre-established Taxonomy (cont): ETOC - Publish and Browse Courtesy: PCDocs/Fulcrum

Records Management Applications WE Finally Made IT! This is the Records Management Part

Automatic Categorization: Advantages • Supports the filing of new electronic records into an existing file plan • Suggests a “natural” organization for an existing electronic record corpus • Identifies topics within an existing corpus and provides insight into the corpus content • Identifies unknown associations with documents • Identifies relevant information from non-relevant information

Automatic Categorization: Disadvantages • Limited accuracy • Does not work well on very short documents, very large documents, or documents without uniform contents • Potentially misleading • Can require a significant investment to set and maintain There is no Magic

Conclusions • Automatic classification is an important tool for managing electronic records. • Records managers need to understand both its power and its limitations to apply it operationally. • Records managers must play a key role in its implementation by imparting their understanding of the organization’s information structure into the system. • Text processing tools, such as automatic categorization, provide an opportunity for records managers to apply state-of-the-art technology to improve their records management programs.

Dilbert’s Salary Theorem Knowledge is Power Time is Money Power = Work/Time It follows then that: Money=Work/Knowledge Thus, As Knowledge è 0, Money è Infinity Conclusion: The less you know the more you make, regardless of the amount of work done! Thanks to PCDocs/Fulcum

S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management

S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management

Presentation Transcript

How It Works

How It Works

How it works

How it works

How it works

How it works

How It Works

How It Works?

HOW it Works

How it works

Document Categorization Issues

How It Works

Access to Mental Health Records and Related Issues

How it works

How it Works

How it works?

How it works

HOW IT WORKS

How it works

HOW IT WORKS

Things on Psychotherapy and how it Works

HOW IT WORKS?