Looking Under the Hood of An Automated Categorization Engine

Looking Under the Hood of An Automated Categorization Engine Dr. Denise A. D. Bedford Goodyear Professor of Knowledge Management Kent State University Kent Ohio Attorneys & Algorithms AIIM ARMA February 27, 2013

Goals of the Presentation • Tomaytoes or tomahtoes - level set the dialog in terms of what we all understand by autocategorizationand its value proposition • Provide a conceptual architecture against which you can evaluate any product or solution for automated categorization problem • Walk through some of the semantic analysis methods so you understand how the application makes decisions and suggestions • Provide some sample projects from start to finish to understand the human investments required • Consider how to validate your results

TOMAYTOES OR TOMAHTOES

Basic Assumption • The human approach to classification, cataloging, indexing and summarization has worked well, with one caveat – it is not an economically efficient approach • There are not enough people, enough time to perform these tasks to the level of granularity and the level of quality required to meet expectations for knowledge organization, access and discovery • As the need to organize increases, our demand for smart automated solutions also increases • The challenge is understanding what those solutions are, and how they work – too many of the solutions are “black box”

Every semantic solution that works well involves the integration of human knowledge and is designed around the human thought process

What do we mean by categorization? • This term is used to mean many different things – by software vendors, by web designers, by information and records professionals, and domain experts • Let’s define it broadly today as any approach that helps us to organization something according to some framework • Today, we won’t limit our discussion just to classes and categories and the process of categorization (i.e., classification, discovery and creation of classification schemes, evaluation of schemes) • We’re going to talk about several different ways that we can use automation to provide values that we can use to classify

Expanded View of Categorization • Organizing objects by: • Author or creator • Country • Subject or topic • Records class • Business function • Organizational unit • Language • Economic Sector • Audit Status • Characterizing objects by: • Emotional or attitudinal characteristics • Extensionality or intentionality of language • Degree of expressed aggression or anxiety • Positive or negative perspective

Conceptual Architecture for AutoCategorization

What do we mean by “auto”? • This is the $64,000,000 question and it is why you need to understand how the applications work • There are many flavors of “automated” categorization engines on the market today • Each product has a different set of components if you look under the hood • And, perhaps more importantly, each product takes a different approach to integrating human knowledge

What do we mean by “semantic”? • No pun intended here, but this is a critical question when we’re talking about automated anything to do with human knowledge and information • Semantic is literally defined as about meaning and understanding • When we talk about semantic technologies we mean technologies that can understand something about the objects they’re working with and something about the process we’re asking them to perform • Not all of the auto-categorization engine you may encounter will be semantic • Let’s start out by talking about what makes an engine semantic

Conceptual Architecture For An Automated Categorization Engine Semantic Analysis Methods Grammar-Based Concept Extraction Rule-Based Categorization Statistical Categorization Rule-Based Summarization Outputs Frequency Based Concept Extraction Outputs Rule-Based Concept Extraction Rule-Based Summarization Knowledge Bases Country Profiles Records Class Profiles Content Type Profiles Attitudinal Profiles Subject Profiles Business Function Profiles NLP Foundation POS Tagger Language Dictionaries Language Detection Grammatical Rules Morphological Rules

The Semantics Part of the Engine • Semantic engines have components that provide a basis for a primitive level of understanding • Ability to detect the language in which something is expressed [LANGUAGE DETECTION] • Knowledge of all the concepts or ideas (beyond words but words are a good starting point…) in that language [DICTIONARY OF THE LANGUAGE] • Understanding of the morphology and grammatical rules that govern the language [GRAMMATICAL RULES] • Ability to generate a semantic index of any object you’re going to process [Part-of-Speech Taggers] • If you don’t see these parts when you look under the hood of your engine, you’re not working with semantics

What do we mean by NLP and POS?

The Knowledge Base • Generally, the categorization process involves a person with a lot of knowledge and experience (invisible knowledge base), interpreting an object, and aligning it with the framework or a set of rules (visible knowledge base) • What is the equivalent knowledge base in an autocategorization that would lead to as good a result as the human produced? • What we often overlook in thinking about automated categorization engines is their ability to integrate and take advantage of this invisible or visible knowledge base • The other part of the semantic engine then is the knowledge base

Knowledge Base Examples • Too often we accept what the tool developer gives us to work with as a knowledge base without critical evaluation • Let’s say you want to categorize a set of documents to the countries that they are “about” - how do most engines do this? Is it simple discover of the name of the country? Or, is there a richer knowledge base that describes countries? • Let’s say you want to organize a set of documents by the people they refer to – how do most engines do this? How do they determine what is a person? And, whether the person is an author, a participant, or a referenced author?

Knowledge Base Examples • Let’s say you want to automatically classify your documents to your records classification scheme? What does the engine need to know to do this well? • You might also want to categorize your documents to business functions, by the economic sector they refer to, by the products or services they produce, or the sentiment expressed • All of these kinds of knowledge bases are now embedded in our human brains, or they’re designed in a way for human beings to consume and interpret – lots of tacit knowledge • We need to think about how to represent these so a machine can begin to understand and apply them

Semantic Analysis Methods • So, let’s say you’re fortunate enough to have an engine with a good foundation, and you’ve got some good knowledge bases to work with…. Now what? • The engine needs to have some semantic analysis capabilities to (1) interpret the document, and (2) apply the knowledge bases, to (3) produce the results that you’ve asked it for • This is the “thinking part” of the categorization process • Here’s another major differentiating factor – semantic analysis methods – what does your tool come with? • We’ll go into this in more detail in a minute…..

Outputs - Not an Insignificant Issue • Assuming you have all these components and you’re generating what you need to organize your resources • What can you do with the results? Search, browse, navigate, manage information, build a metadata store, send values to your records system? • What kind of outputs do you get? Can you get an xml record for each object processed, or are the values just imbedded in an internal index • How durable are these outputs? Can you export the values and use them in other applications? Or, is everything tightly integrated into the product itself? If you stop using the product, what are you left with?

Making Informed Choices • The engine we just described is the equivalent of a Mercedes or a Rolls Royce – it can take you most anywhere you want to go, will do so with high quality results, will be durable, and can be designed to meet your needs • Not everyone can afford a Mercedes or a Rolls Royce – just because you can’t afford the best, doesn’t mean you shouldn’t know what you’re buying • If you buy Chevy or a Chrysler, you should know how far it can take you, what you can expect it to do, and how durable it will be • Make sure that your choice doesn’t produce unintended consequences

Making Informed Choices • You also need to know how much it is going to cost you to keep the engine running • Semantic engines need fuel to run on, just like automobiles • The fuel in this case is that knowledge base layer • If you’re running a very efficient engine, you’re probably putting in high quality fuel and this is going to cost more • Machines don’t think without human direction, and machines don’t produce good results without human oversight and engineering • If you want good results, you need to set your expectations for investments

Sample Semantic Analysis Methods

Common Semantic Analysis Methods • Grammar Based Concept Extraction • Rule Based Concept Extraction • Rule Based Categorization • Dynamic Categorization or Statistical Clustering • Summarization

Grammatical Concept Extractions • What is it? • A simple pattern matching algorithm which matches your specifications to the underlying grammatical entities • For example, you could define a grammar that describes a proper noun for people’s names or for sentence fragments that look like titles • How does it work? • This is also a pattern matching program but it uses computational linguistics knowledge of a language in order to identify the entities to extract – if you don’t have an underlying semantic engine, you can’t do this type of extraction • There is no authoritative list in this case – instead it uses parsers, part-of-speech tagging and grammatical code • The semantic engine’s dictionary determines how well the extraction works – if you don’t have a good dictionary you won’t get good results • There needs to be a distinct semantic engine for each language you’re working with

Grammatical Concept Extractions • How do we build it? • Model the type of grammatical entity we want to extract and use the grammar definitions to build a profile • Test the profile on a set of test content to see how it behaves • Refine the grammars • Deploy the profile • State of Industry • It has taken decades to get the grammars for languages well defined • There are not too many of these tools available on the market today but we are pushing to have more open source • SAS CC now has grammars and semantic engines for 30 different languages commercially available • IFC has been working with ClearForest • Let’s look at some examples of grammatical profiles – People’s Names, Noun Phrases, Verb Phrases, Book Titles

Grammatical Concept Extraction –People Names Client testing mode

Rule Based Concept Extraction • How do we build it? • Create a comprehensive list of the names of the entities – most of the time these already exist, and there may be multiple copies • Review the list, study the patterns in the names, and prune the list • Apply regular expressions to simplify the patterns in the names • Build a Concept Profile • Run the concept profile against a test set of documents (not a training set because we build this from an authoritative list not through ‘discovery’) • Review the results and refine the profile • State of Industry • The industry is very advanced – this type of work has been under development and deployed for at least three decades now. It is a bit more reliable than grammatical extraction, but it takes more time to build. • Publicly available example: OpenCalais

Rule Based Concept Extraction • What is it? • Rule based concept or entity extraction is a simple pattern recognition technique which looks for and extracts named entities • Entities can be anything – but you have to have a comprehensive list of the names of the entities you’re looking for • How does it work? • It is a simple pattern matching program which compares the list of entity names to what it finds in content • Regular expressions are used to match sets of strings that follow a pattern but contain some variation • List of entity names can be built from scratch or using existing sources – we try to use existing sources • A rule-based concept extractor would be fueled by a list such as Working Paper Series Names, edition or version statement, Publisher’s names, etc. • Generally, concept extraction works on a “match” or “no match” approach – it matches or it doesn’t • Your list of entity names has to be pretty good

People Names Loan # Credit # Report # Contract # Addresses Phone number Zip code ISBN, ISSN Organization Name(companies, NGOs, IGOs, governmental organizations, etc.) Library of Congress Class Number Document Object Identifier URLs SAIC Codes Edition or version statement Series Name Publisher Name Rules Based Concept Extraction Examples

List of entities matches exact strings. This requires an exhaustive list– but gives us extensive control. (It would be difficult to distinguish by pattern between IGOs and other NGOs.) Classifier concept extraction allows us to look for exact string matches

Another list of entities matches exact strings. In this case, though, we’re making this into an ‘authority control list’– We’re matching multiple strings to the one approved output. (In this case, the AACR2-approved edition statement.)

ISBN Concept Extraction Profile – Regular Expressions (RegEx) Replace this slide with the ISBN screen – with the rules displayed Concept based rules engine allows us to define patterns to capture other kinds of data Use of concept extraction, regular expressions, and the rules engine to capture ISBNs. Regular expressions match sets of strings by pattern, so we don’t need to list every exact ISBN we’re looking for.

Grammatical Concept Extraction Proper Noun Profile for People Names uses grammars to find and extract the names of people referenced in the document. <?xml version="1.0" encoding="UTF-8"?> <Proper_Noun_Concept> <Source><Source_Type>file</Source_Type> <Source_Name>W:/Concept Extraction/Media Monitoring Negative Training Set/ 001B950F2EE8D0B4452570B4003FF816.txt</Source_Name> </Source><Profile_Name>PEOPLE_ORG</Profile_Name> <keywords>Abdul Salam Syed, Aruna Roy, Arundhati Roy, ArvindKesarival, Bharat Dogra, Kwazulu Natal, MadhuBhaduri, </keywords><keyword_count>7</keyword_count> </Proper_Noun_Concept>

Rule-Based Categorization • What is it? • Categorization is the process of grouping things based on characteristics • Categorization technologies classify documents into groups or collections of resources • An object is assigned to a category or schema class because it is ‘like’ the other resources in some way • Categories form part of a hierarchical structure when applied to such subjects as a taxonomy • How does it work? • Automated categorization is an ‘inferencing’ task- meaning that we have to tell the tools what makes up a category and then how to decide whether something fits that category or not • We have to teach it to think like a human being – • When I see -- access to phone lines, analog cellular systems, answer bid rate, answer seizure rate – I know this should be categorized as ‘telecommunications’ • We use domain vocabularies to create the category descriptions

Rule Based Categorization • How do we build it? • Build the hierarchy of categories • Build a training set of content category by category – from all kinds of content • Describe each category in terms of its ‘ontology’ – in our case this means the concepts that describe it (generally between 1,000 and 10,000 concepts) • The richer the definition, the better the categorization engine works • Test each category profile on the training set • Test the category profile on a larger set that is outside the domain • Insert the category profile into the profile for the larger hierarchy • State of the Industry • Only a handful of rule-based categorizers are on the market today • Most are dynamic clustering tools • AI community is becoming more interested in this work – Stanford’s AI lab, Stanford Research Institute • Let’s distinguish between categorization and clustering – a common point of confusion

Clustering vs. Categorization • Clustering Categorization

Categorization Examples • Let’s look at some working examples by going to the SAS CC profiles • Topics • Countries • Regions • Sector • Theme • Disease Profiles • Other categorization profiles we’re also working on… • Business processes (characteristics of business processes) • Sentiment ratings (positive media statements, negative media statements, etc.) • Document types (by characteristics found in the documents) • Security classification (by characteristics found in the documents) • ICSID Tribunal Outcomes • Diseases

Country Categorization Profile

Subtopics Domain concepts or controlled vocabulary

Topics Categorization Client Test

Automatically Generated XML Metadata Metadata is generated using a categorization profile – reflects the way people organize.

Automatically Generated Metadata

Automatically Generated XML Metadata for Business Function attribute • )

Clustering or Dynamic Categorization • What is it? • The use of statistical and data mining techniques to partition data into sets. Generally the partitioning is based on statistical co-occurrence of words, and their proximity to or distance from each other • How does it work? • Those words that have frequent occurrences close to one another are assigned to the same cluster • Clusters can be defined at the set or the concept level – usually the latter • Can work with a raw training set of text to discover and associate concepts or to suggest ‘buckets’ of concepts • Some few tools can work with refined list of concepts to be clustered against a text corpus

Dynamic Categorization or Clustering • How do we build it? • Define the list of concepts • Create the training set • Load the concepts into the clustering engine • Generate the concept clusters • State of Industry • Most of the commercial tools that call themselves ‘categorizers’ are actually clustering engines • Generally, doesn’t work at a high domain level for large sets of text • They can provide insights into concepts in a domain when used on a small set of documents • All the engines are resource intense, though, and the outputs are transitory – clusters live only in the cluster index • If you change the text set, the cluster changes

Sample Concept Clusters This is from the clustering output for Wildlife Resources. ‘Clusters’ of words between line breaks are terms from the Wildlife Resources controlled vocabulary found co-occurring in the same training document. This highlights often subtle relationships.

Summarization • What is it? • Rule-driven pattern matching and sentence extraction programs • Important to distinguish summarization technologies from some information extraction technologies - many on the market extract ‘fragments’ of sentences – what Google does when it presents a search result to you • Will generate document surrogates, poiint of view summaries, HTML metatag Description, and ‘gist’ or ‘synopsis’ for search indexing • Results are sufficient for ‘gisting’ for html metatags, as surrogates for full text document indexing, or as summaries to display in search results to give the user a sense of the content • How does it work? • Uses rules and conditions for selecting sentences • Enables us to define how many sentences to select • Allows us to tell us the concepts to use to select sentences • Allows us to determine where in the sentence the concepts might occur • Allows us to exclude sentences from being selected • We can write multiple sets of rules for different kinds of content

Summarization • How do we build it? • Analyze the content to be summarized to understand the type of speech and writing used – IRIS is different from Publications is different from News stories • Identify the key concepts that should trigger a sentence extraction • Identify where in the sentence these concepts are likely to occur • Identify the concepts that should be avoided • Convert concepts and conditions to a rule format • Load the rule file onto the summarization server • Test the rules against test set of content and refine until ‘done’ • Launch the summarization engine and call the rule file • State of Industry • Most tools are either readers or extractors. Readermethod uses clustering & weighting to promote sentence fragments. Extractor method uses internal format representation, word & sentence weighting • What has been missing from the Extractors in most commercial products is the capability to specify the concepts and the rules. SAS CC is the only product we found to support this.

Summarization Rules

Three Sample Projects

Looking Under the Hood of An Automated Categorization Engine

Looking Under the Hood of An Automated Categorization Engine

Presentation Transcript

PHP: Under The Hood

Under the hood

CFSRR under the hood

WinHelp Under the Hood

The Architecture of GenDevs: looking under the hood of DEVSJAVA 3.0

Looking Under The Hood : An N-body Mechanic Tutorial

IPv6 “Under the Hood”

TCP/IP Stack Introduction: Looking Under the Hood!

PC Under the Hood

Under the hood

Under the Hood

Under The Hood

Under the Hood

TCP/IP Stack Introduction: Looking Under the Hood!

How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine

Looking Under the Hood

Under The Hood

Chapter 17 Looking “Under the Hood”

Looking Under the Hood

Looking Under the Hood