280 likes | 489 Views
S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management. ARMA International Conference 2001. R. Kirk Lubbes, CRM, CDIA, MIT President Records Engineering, LLC klubbes@recordsengineering.com. 30 September 2001. Automatic Categorization Definition.
E N D
S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management ARMA International Conference 2001 R. Kirk Lubbes, CRM, CDIA, MIT President Records Engineering, LLC klubbes@recordsengineering.com 30 September 2001
Automatic Categorization Definition Automatic categorization provides the potential means to automatically file records in pre-established file plans or taxonomies, increasing the ease for correlating information
Why Automatic Categorization? • Computers produce over 1,000,000,000 pages of output per day in the United States. Laid end to end, it would encircle the earth 20 times. • 7,300,000 web pages are added to the Internet per day. • 5,000,000,000 e-mail messages are sent per day. INFORMATION OVERLOAD Too much information, coming too fast to manage manually
Understanding How Automatic Categorization Works • Requires a basic understanding of: • Feature Extraction • Clustering • Data Visualization • Summarization • Training Sets
Structured vs. Unstructured Data Structured Data: Fielded data, data that is generally stored in a relational database, e.g. metadata or indexing parameters Unstructured Data: Data not contained in fields, e.g. free text documents, images, audio, video Computers were designed to process structured data. Computers do not process unstructured data effectively
A Few (hopefully) Helpful Definitions A feature is a prominent, striking, or conspicuous characteristic A feature set is a collection of features related to an object, e.g. a text document A vector is a physical quantity that has magnitude and direction A feature vector is a vector which represents a feature set, i.e., where each dimension of the vector is a different feature, i.e., a different characteristic of the object (text document) which it represents
Feature 1 Feature 2 . . . Feature n-1 Feature n Feature Extraction Unstructured • Parsing the document to create a list of words or phrases • Eliminating stop words • Stemming (eliminating prefixes and suffixes) • Selecting descriptive features • Generating Feature Vectors Feature Vector (Structured Representation of Document) Structured
Visualizing Feature Vectors V5 V3 Nuclear Energy Axis V4 V1 V5* Records Control Plan Axis V2 Records Inventory Axis
What Does This Have To Do with Records Management? STICK WITH ME, I AM GETTING THERE
Clusters and Centroids A cluster is a group of objects whose members are more similar to each other than to the members of any other group. When document feature vectors endpoints are clustered, they indicate that the documents are related to the same topic. A centroid is the center point of a cluster. Each topic has an associated centroid. All documents whose feature-vector’s endpoints are near a given topic’s centroid are related to that topic
Radius and Training Set The Radius is the distance from the centroid which will be used to determine whether the given feature vector will be associated with the given centroid. The radius may be user defined or defined by the categorization software.
Training an Automatic Classification System A training set is a collection of documents that are selected as representative of a subject heading in a file plan or a taxonomy. Training sets are used to calculate the subject heading’s associated centroid. If a document’s feature vector endpoint is near a given subject heading’s centroid, the document is assumed to be about that subject and is filed under its heading
Data Visualization ExampleThemeScape by Cartia • Shows what information is available in a collection of documents without reading the documents. • Every document and/or web page is organized onto a topographical map based on the information it contains. • Powerful search capabilities highlight relevant documents on the map. • Requires no manual categorizations or document tagging. • It reads all the documents in a collection and organizes them onto an information map. Courtesy: Cartia, Inc.
ThemeScape Information Map Courtesy: Cartia Inc
Legend for ThemeScapeInformation Map Courtesy: Cartia, Inc.
GalaxiesPacific Northwest National LaboratoryUnited States Department of Energy
ThemeviewPacific Northwest National LaboratoryUnited States Department of Energy
Summarization ExampleConvera’s (Excalibur’s) RetrievalWare Courtesy: Excalibur Technologies
Automatic Categorization Example:Autonomy’s Knowledge Server Automates the categorization of large volumes of both internal and external information (support over 200 formats) Supports categorizer training through examples Automatically inserts hypertext links to related information Presents a unified view of disparate information sources and shows how information is related Courtesy: Autonomy Systems Ltd.
Pre-established Taxonomy:PCDocs’ ETOC • Enterprise Table of Contents (ETOC) • Automatically categorizes new documents in existing taxonomies • Allow browsing by subject category and searching in context • Promotes Knowledge Discovery • Result List Clustering (RLC) • Group documents in a result list to discover micro structures in documents Courtesy: PCDocs/Fulcrum
Pre-established Taxonomy (cont): ETOC - Publish and Browse Courtesy: PCDocs/Fulcrum
Records Management Applications WE Finally Made IT! This is the Records Management Part
Automatic Categorization: Advantages • Supports the filing of new electronic records into an existing file plan • Suggests a “natural” organization for an existing electronic record corpus • Identifies topics within an existing corpus and provides insight into the corpus content • Identifies unknown associations with documents • Identifies relevant information from non-relevant information
Automatic Categorization: Disadvantages • Limited accuracy • Does not work well on very short documents, very large documents, or documents without uniform contents • Potentially misleading • Can require a significant investment to set and maintain There is no Magic
Conclusions • Automatic classification is an important tool for managing electronic records. • Records managers need to understand both its power and its limitations to apply it operationally. • Records managers must play a key role in its implementation by imparting their understanding of the organization’s information structure into the system. • Text processing tools, such as automatic categorization, provide an opportunity for records managers to apply state-of-the-art technology to improve their records management programs.
Dilbert’s Salary Theorem Knowledge is Power Time is Money Power = Work/Time It follows then that: Money=Work/Knowledge Thus, As Knowledge è 0, Money è Infinity Conclusion: The less you know the more you make, regardless of the amount of work done! Thanks to PCDocs/Fulcum