1 / 28

S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management

S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management. ARMA International Conference 2001. R. Kirk Lubbes, CRM, CDIA, MIT President Records Engineering, LLC klubbes@recordsengineering.com. 30 September 2001. Automatic Categorization Definition.

libra
Download Presentation

S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. S 41: Automatic Categorization: How It Works, Related Issues, and Impacts on Records Management ARMA International Conference 2001 R. Kirk Lubbes, CRM, CDIA, MIT President Records Engineering, LLC klubbes@recordsengineering.com 30 September 2001

  2. Automatic Categorization Definition Automatic categorization provides the potential means to automatically file records in pre-established file plans or taxonomies, increasing the ease for correlating information

  3. Why Automatic Categorization? • Computers produce over 1,000,000,000 pages of output per day in the United States. Laid end to end, it would encircle the earth 20 times. • 7,300,000 web pages are added to the Internet per day. • 5,000,000,000 e-mail messages are sent per day. INFORMATION OVERLOAD Too much information, coming too fast to manage manually

  4. Understanding How Automatic Categorization Works • Requires a basic understanding of: • Feature Extraction • Clustering • Data Visualization • Summarization • Training Sets

  5. Structured vs. Unstructured Data Structured Data: Fielded data, data that is generally stored in a relational database, e.g. metadata or indexing parameters Unstructured Data: Data not contained in fields, e.g. free text documents, images, audio, video Computers were designed to process structured data. Computers do not process unstructured data effectively

  6. A Few (hopefully) Helpful Definitions A feature is a prominent, striking, or conspicuous characteristic A feature set is a collection of features related to an object, e.g. a text document A vector is a physical quantity that has magnitude and direction A feature vector is a vector which represents a feature set, i.e., where each dimension of the vector is a different feature, i.e., a different characteristic of the object (text document) which it represents

  7. Feature 1 Feature 2 . . . Feature n-1 Feature n Feature Extraction Unstructured • Parsing the document to create a list of words or phrases • Eliminating stop words • Stemming (eliminating prefixes and suffixes) • Selecting descriptive features • Generating Feature Vectors Feature Vector (Structured Representation of Document) Structured

  8. Feature Vectors: An Example from a Simple Corpus

  9. Visualizing Feature Vectors V5 V3 Nuclear Energy Axis V4 V1 V5* Records Control Plan Axis V2 Records Inventory Axis

  10. What Does This Have To Do with Records Management? STICK WITH ME, I AM GETTING THERE

  11. Clusters and Centroids A cluster is a group of objects whose members are more similar to each other than to the members of any other group. When document feature vectors endpoints are clustered, they indicate that the documents are related to the same topic. A centroid is the center point of a cluster. Each topic has an associated centroid. All documents whose feature-vector’s endpoints are near a given topic’s centroid are related to that topic

  12. Radius and Training Set The Radius is the distance from the centroid which will be used to determine whether the given feature vector will be associated with the given centroid. The radius may be user defined or defined by the categorization software.

  13. Training an Automatic Classification System A training set is a collection of documents that are selected as representative of a subject heading in a file plan or a taxonomy. Training sets are used to calculate the subject heading’s associated centroid. If a document’s feature vector endpoint is near a given subject heading’s centroid, the document is assumed to be about that subject and is filed under its heading

  14. Data Visualization ExampleThemeScape by Cartia • Shows what information is available in a collection of documents without reading the documents. • Every document and/or web page is organized onto a topographical map based on the information it contains. • Powerful search capabilities highlight relevant documents on the map. • Requires no manual categorizations or document tagging. • It reads all the documents in a collection and organizes them onto an information map. Courtesy: Cartia, Inc.

  15. ThemeScape Information Map Courtesy: Cartia Inc

  16. Legend for ThemeScapeInformation Map Courtesy: Cartia, Inc.

  17. GalaxiesPacific Northwest National LaboratoryUnited States Department of Energy

  18. ThemeviewPacific Northwest National LaboratoryUnited States Department of Energy

  19. Summarization ExampleConvera’s (Excalibur’s) RetrievalWare Courtesy: Excalibur Technologies

  20. Automatic Categorization Example:Autonomy’s Knowledge Server Automates the categorization of large volumes of both internal and external information (support over 200 formats) Supports categorizer training through examples Automatically inserts hypertext links to related information Presents a unified view of disparate information sources and shows how information is related Courtesy: Autonomy Systems Ltd.

  21. Courtesy of Autonomy

  22. Pre-established Taxonomy:PCDocs’ ETOC • Enterprise Table of Contents (ETOC) • Automatically categorizes new documents in existing taxonomies • Allow browsing by subject category and searching in context • Promotes Knowledge Discovery • Result List Clustering (RLC) • Group documents in a result list to discover micro structures in documents Courtesy: PCDocs/Fulcrum

  23. Pre-established Taxonomy (cont): ETOC - Publish and Browse Courtesy: PCDocs/Fulcrum

  24. Records Management Applications WE Finally Made IT! This is the Records Management Part

  25. Automatic Categorization: Advantages • Supports the filing of new electronic records into an existing file plan • Suggests a “natural” organization for an existing electronic record corpus • Identifies topics within an existing corpus and provides insight into the corpus content • Identifies unknown associations with documents • Identifies relevant information from non-relevant information

  26. Automatic Categorization: Disadvantages • Limited accuracy • Does not work well on very short documents, very large documents, or documents without uniform contents • Potentially misleading • Can require a significant investment to set and maintain There is no Magic

  27. Conclusions • Automatic classification is an important tool for managing electronic records. • Records managers need to understand both its power and its limitations to apply it operationally. • Records managers must play a key role in its implementation by imparting their understanding of the organization’s information structure into the system. • Text processing tools, such as automatic categorization, provide an opportunity for records managers to apply state-of-the-art technology to improve their records management programs.

  28. Dilbert’s Salary Theorem Knowledge is Power Time is Money Power = Work/Time It follows then that: Money=Work/Knowledge Thus, As Knowledge è 0, Money è Infinity Conclusion: The less you know the more you make, regardless of the amount of work done! Thanks to PCDocs/Fulcum

More Related