470 likes | 606 Views
Content Classification – Where’s My Stuff?. Agenda. Why Classify? How Does Classification Work? Content Classification Features Taxonomy Basics Starting a Classification Project Best Practices for Classification Real World Example Closing thoughts. Why Classify?.
E N D
Content Classification – Where’s My Stuff? IBM Confidential
Agenda • Why Classify? • How Does Classification Work? • Content Classification Features • Taxonomy Basics • Starting a Classification Project • Best Practices for Classification • Real World Example • Closing thoughts IBM Confidential
Why Classify? • Content that is not properly classifiedis not accessible • 1 in 2 business leaders don’t have access to the information they need to do their jobs • Quality of decision-makingsuffers when content is not accurate • 1 in 3 business leaders frequently make business decisions based on information they lack or don’t trust • Companies face difficulty in deriving full visibility and insight into breadth and depth of unstructured content • 77% of CEOs don’t have immediate information to make key business decisions Sources: IBM 2010 CEO & CFO Studies, IBM 2010 Break Away With Business Analytics and Optimization Study IBM Confidential
Why Classify? • What if you walked into the Library of Congress and there was no Dewey Decimal System? • What about the hardware store, the grocery store, the clothing store? • Do you park your car in the living room and place your sofa in the garage? Everything in our life is categorized and classified in some way • You need to: • Find relevant content, quickly • Accurately, consistently categorize content • Gather meaning and understanding from the content • You have: • Millions of pieces of content • Hundreds of repositories • Thousands of workers IBM Confidential
Why Classify? You have been storing content for many years, but… can you find it when you need it? can you produce it for audits and litigation? can you gain insight from it? How does your organization go from this…. to this? IBM Confidential
Why Classify? IBM Confidential
Why Classify? Accessibility, Usability, Compliance, Analytics • Can you find relevant content, quickly? • “Search, Refine, Repeat” is no longer acceptable • Image Capture, Content Collection, Enterprise Search • Is the right content available at the right time? • Business processes require timely access to content • Business Process Management, Case Management • Are you complying with Legal and Business mandates? • Content has a compliance lifecycle that must be enforced • Content Collection, Enterprise Records, eDiscovery • Are you uncovering business insight from your content? • Organized content produces better insight • Content Analytics IBM Confidential
Why Classify? • Automated Classification makes information accessible, leaving your workers to focus on important business tasks rather searching, over and over, for relevant content • Classification provides enhanced content usabilityby automating routing decisions based on the meaning of the text in your content • Advanced Classification, combined with collection and records, enables your company to complywith business and legal mandates • Classification augments Content Analytics by providing extended facet navigationand content clustering,delivering added analysisand insight
Agenda • Why Classify? • How Does Classification Work? • Content Classification Features • Taxonomy Basics • Starting a Classification Project • Best Practices for Classification • Real World Example • Closing thoughts IBM Confidential
How does Classification work? CLASSIFICATION AS A FACTORY WORKER • Think of a worker at the end of an assembly line • Task is to sort items coming down the line into correct containers • Four possible item types on the line: • Can • Box • Bottle • Jar • How do you tell the factory worker which is which? • Start with the item to the right as a ‘can’ reference model • 6.5” high • Red with blue & white lettering • 3.5” diameter • Opened with a tab • Contains liquid IBM Confidential
How does Classification work? Based on initial assumptions, which of these are “cans”? • Based on the original reference model, which of these is a can? • 6.5” high • Red with blue & white lettering • 3.5” diameter • Opened with a tab • Contains liquid • What are our identification parameters? • Shape? • Capacity/size? • Contents (liquid vs. solid)? • Method of opening? • Construction material? IBM Confidential
How does Classification work? • Analogy is very relevant to category definition & corpus selection • Document classification involves the same problems • What is an “Accounting and Finance” document? • How can we differentiate it from a “Legal” document? • How about “Regulatory?” • How do humans tell which is which? • Keywords • Phrases • Intent • Some distinctions are clear… • Legal vs. Engineering • Personnel vs. Operations • Manufacturing vs. Advertising • Others are not… • Legal vs. Regulatory • Classification effort depends on your environment IBM Confidential
How does Classification work? Business Information B Engineeringdrafts require approval AIntellectual Property isessential Category ‘B’ Engineering Category ‘A’ Marketing B Engineeringrequires skilled software staff A Legal ischanging the timeframe forcontract approval A Legal iscurrentlyrequiringfull approval B Engineeringrequires clearrequirements Category ‘C’ Strategy C Strategy should look out over 36 months Context-Based Classification CStrategy is Important tothe marketing team AThe core marketfor this newproduct has beendefined as such by IBM ?The core marketfor this newproduct has beendefined as such by IBM IBM Confidential
How does Classification work? • Content Classification combines multiple methods of categorization technologies to deliver the automatic classification • Uses natural language processing and semantic analysis • Uses rules-based on metadata or confidence score • Can be used in tandem or separately depending on requirements To: Bob Smith <bsmith@hotmail.com> From: Bill Roker <broker@financialadv.com> Subject: Contract? Bob, Hope you’re doing well. A quick note to see if the payment came through, as prescribed by the contract? It would be terrible to have the firm sued over such a simple financial matter. No one wants this project to be derailed. Regards, Bill Bill Roker 212-555-1234 Financial Advisors, Inc. Does the email contains the phrase “contract”? Does the sender belongs to the broker email group? Does the email have anything that matches the pattern “XXX-YY-ZZZZ”? Natural Language Processing + Semantic Analysis + Targeted Rules = Comprehensive Content Classification IBM Confidential
Agenda • Why Classify? • How Does Classification Work? • Content Classification Features • Taxonomy Basics • Starting a Classification Project • Best Practices for Classification • Real World Example • Closing thoughts IBM Confidential
Content Classification Features • Automatic Categorization of documents and emails • Analyzes the content of documents and emails in order to categorize them • Uses natural language processing and semantic analysis • Handles imperfect language (misspellings, abbreviations, poor grammar) • Assigns confidence score to each category suggestion (0 – 100) • Learns from examples or keywords • Creates a profile for each category by analyzing sample texts • Categories can also be defined by keywords • Combines classification methods using text analysis and rules processing • Rules based on metadata can be defined in combination with classification based on confidence score • Language identification capability can be used in tandem with rules IBM Confidential
Content Classification Features • Learns in real-time • Can adapt based on feedback from end users or administrators • Feedback is incorporated into analysis on-the-fly for immediate adaptation • Classification Workbench configuration tool • Enables the process of creation and maintenance of Knowledge Bases and Decision Plans • Facilitates classification tune-up and reporting • Integrated to IBM ECM offerings • Application for bulk classification of content upon ingestion to repository and bulk classification and reclassification of content already under management • Integrated with Datacap, Content Collector, Enterprise Records, Analytics, etc. • Taxonomy Creation Assistance • Suggests new taxonomies for organizations that do not have them • Suggests new elements for existing taxonomies IBM Confidential
Content Classification Features – Knowledge Base • A knowledge base contains learned information that Classification needs to perform matching, training, and online learning • It is filled with relevant statistical and semantic information derived from sample texts • Statistical entities consist of words, number of occurrences, hints about the text, and distance between words • A knowledge base is created & maintained through the Workbench application • Collect and organize sample content • Create, analyze, and learn • Assess performance, review reports IBM Confidential 18
Content Classification Features – Decision Plan • A Decision Plan is a collection of rules that you configure to determine how content is classified • A Decision Plan is developed by configuring one or more rules based on content or metadata. • Each rule consists of one trigger and one or more actions • Example: Trigger: “If Title contains ‘Contract’ ” then, Action: “Assign to Contracts Category” & “Move to Contracts folder” • Rules can use strings, word distance, regular expressions, pattern extraction, Boolean expressions • Actions include set properties, invoke analysis, move to folder, declare record, custom actions, and more • Decision Plans can be used with or without a Knowledge Base IBM Confidential
Agenda • Why Classify? • How Does Classification Work? • Content Classification Features • Taxonomy Basics • Starting a Classification Project • Best Practices for Classification • Real World Example • Closing thoughts IBM Confidential
Content Classification – Taxonomy Basics Business Taxonomy • Usually follows a line of business hierarchy • Logical grouping of content for business, repositories or compliance purposes. • Generally “flattened” for better control and management Taxonomy • The science or technique of classification. • A classification into ordered categories. • The science dealing with the description, identification, naming, and classification of organisms. 7 levels 3-4 levels IBM Confidential
Content Classification – Taxonomy BasicsThe Goldilocks Zone “Too Many Categories” 1000 categories is probably too many IBM Confidential
Content Classification – Taxonomy BasicsThe Goldilocks Zone “Too Few Categories” 10 categories is probably too few IBM Confidential
Content Classification – Taxonomy BasicsThe Goldilocks Zone “Just Right” Somewhere around 100 categories is probably just right IBM Confidential
Content Classification – Taxonomy Basics • Taxonomies are important, but… • They do not have to be complex or unwieldy • Need to be acceptable to different organization areas • Finance, Legal, HR, IT • Your organization may have a formal, internal taxonomy • If so, start there, but it may have to be flattened • Your organization may have a de facto taxonomy • ECM document classes, folders, File System structures, Departmental structures, may be enough to start • Publicly available or 3rd-party taxonomies may be used • Again, may have to be flattened • How are humans classifying today? • Are workers filing paper in folder, drawers, cabinets? • Are worker putting content in ECM, File Systems, Folders? IBM Confidential
Agenda • Why Classify? • How Does Classification Work? • Content Classification Features • Taxonomy Basics • Starting a Classification Project • Best Practices for Classification • Real World Example • Closing thoughts IBM Confidential
Starting a Classification Project • Approaches • Taxonomy Proposal through Content Clustering • Taxonomy Creation through “Seeded” Keywords • Taxonomy Creation through Manual Content Gathering • Knowledge Base Creation through Content Extraction IBM Confidential
Starting a Classification Project • Taxonomy Proposal through Content Clustering • We don’t know, what we don’t know • Starting from a blank sheet categorize A cluster B create gather C crawl evaluate D IBM Confidential
Starting a Classification Project • Taxonomy Creation through “Seeded” Keywords • We know, what we don’t know • Starting from a blank sheet Keyword-basedcontent set Knowledge Basecreation gather Workbench crawl review keyword Keyword Seededtaxonomy keyword evaluate& tune keyword IBM Confidential
Starting a Classification Project • Taxonomy Creation through Manual Content Gathering • We know, what we don’t know • Starting with known content Manually gathered content set Manual content gathering Knowledge Basecreation A A StrawmanTaxonomy B B C C evaluate& tune D IBM Confidential
Starting a Classification Project • Knowledge Base Creation through Content Extraction • We know, what we know • Starting with known content and taxonomy Extracted content set Knowledge Basecreation A Content extraction B C EstablishedECM Repository evaluate& tune D IBM Confidential
Agenda • Why Classify? • How Does Classification Work? • Content Classification Features • Taxonomy Basics • Starting a Classification Project • Best Practices for Classification • Real World Example • Closing thoughts IBM Confidential
Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten) • Look • Listen • Learn
Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten) • Look • In order to properly classify , you need to know your content • Understand how your content is created and by whom • Understand how content used in your business • Understand the meaning and purpose of content • Set realistic expectations • 100% automation with 100% accuracy is rare • Balance automation expectations with accuracy requirements • This is a resume • It is used by Human Resources, Hiring Managers • It is a text document • The purpose is to aide the hiring process • The document may have compliance value
Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten) • Listen • All content owners and users have a stake in proper classification • Gather input and consider all aspects of content, users and organizations • Define categories based on business use • Categories should represent organizational content, not organizational structure • Taxonomies are less hierarchical and flatter than “standard” taxonomies Hierarchical Flat
Best Practices for Classification(or All I really Need to Know about Classification, I learned in Kindergarten) • Learn • Training is iterative, it improves and learns over time • Training sets must contain “high value” examples • Number of training documents varies by organization (~20 to ~50, rule of thumb) • 100’s of documents is less useful than 20 well selected documents • More is not better, it’s just more • Addition of new categories affects existing categories • Some categories may perform well immediately, others may require additional effort • Categories may “drift” over time (content intent, phrases, business changes, etc.) • Learning requires the active use of feedback capabilities Classification systems have to learn……. Remember what Grover taught us…“Three of these things belong together...”
Best Practices for Classification – Summary • Categories • Should be content driven and represent organizational content, not organization chart • Taxonomies • Less hierarchical, generally flatter and less formal than “standard” taxonomies • Training Sets • Training sets should be consistent with actual content and represent “high-value” content • Clearly delineation of content between various categories • Ongoing monitoring and training • Training is iterative, similar to business process optimization, it improves over time • Set Realistic expectations with business user • Balance automation expectations with accuracy requirements • Engage competent and experienced service providers to assist with initial classification project
Agenda • Why Classify? • How Does Classification Work? • Content Classification Features • Taxonomy Basics • Starting a Classification Project • Best Practices for Classification • Real World Example • Closing thoughts IBM Confidential
Real World ExampleImage Capture and Classification Integration between Datacap Taskmaster and Content Classification brings the power of image capture and automated classification together Content Classification provides text analytics and statistical probability to provide another recognition approach to Taskmaster’s vast array of methods
Real World ExampleImage Capture and Classification Classification Challenges • What type of document is this? • to vary processing by type • What pages contain the data I need? • to extract or key in the proper fields • Do the documents contain the correct pages? • to ensure that the documents are “in good order” and not missing information • What is the business meaning of this document? • to get the document to the right person or process with the right priority
Here? Here? Here? Here? Real World ExampleImage Capture and Classification The Separation Challenge • Where does one document end and the next begin? • Traditional Methods • Patch & Barcoded Separator Sheets • Barcode Labels and Documents • Manual Identification • Paper Sorting • Shortcomings • Labor-intensive • Relies on a worker knowledge to correctly identify and sort out the documents • Externally generated documents cannot be barcoded
Real World ExampleImage Capture and Classification Datacap Taskmaster & Classification for Separation & Page Identification • Taskmaster examines each page using multiple methods • The fastest methods are done first : barcode, pattern match, & fingerprint • The slower methods that require OCR follow: Text analytics and keywords • Rules examine the context to determine if any remaining pages can be identified based on the surrounding pages • Taskmaster calls Content Classification to help identify pages • Taskmaster separates and assembles the pages into documents • Content Classification analyzes the text content • Statistical analysis of the text on a page compared to a knowledge base to find the closest match • Assigns confidence score to each category suggestion (0 – 100) • Returns the Classification results to Taskmaster • Classification feedback loop improves future results by providing feedback to the classification engine • Exceptions, low confidence results are reviewed and classified by users
Bank specializing in mortgage loan servicing Slashing costs with IBM Production Imaging Editionand IBM Content Classification Projected benefits • Save millions of dollars of staff time by automating document classification, reducing data entry, and providing direct access to the loan documents with improved speed, accuracy, and granularity. • Save millions of dollars in per-page licensing fees associated with the competitively replaced Kofax KTM system • Provide a platform that can be rapidly ramped up to handle high loads associated with portfolio acquisitions The need • Reduce paper document scanning and processing costs • Reduce loan servicing customer service costs • Processing volumes can exceed 100 million scanned pages per month The solutionThe company contracted with IBM partner Imagine Solutions to implement IBM Production Imaging Edition (PIE) and IBM Classification Module software • PIE - Datacap Taskmaster scans and imports paper documents • PIE - Datacap Taskmaster rules classify documents to the page level using barcodes, image fingerprint pattern matching, regular expressions, and text analytic classification • IBM Classification Module classifies pages using text analytics • Taskmaster extracts text and data fields using optical character recognition (OCR) • Data collection, statistical reporting, and feedback loops improve accuracy and configuration tuning • PIE - FileNet Content Manager securely stores the documents • Acquisition and servicing processes are automated through web-based document access and PIE business process capabilities. The solution is targeted to reduce costs by automating the classifying, keying and filing of millions of pages of loan documentation per day. IBM Confidential
Agenda • Why Classify? • How Does Classification Work? • Content Classification Features • Taxonomy Basics • Starting a Classification Project • Best Practices for Classification • Real World Example • Closing thoughts IBM Confidential
Closing ThoughtsHow can classification help my business? • Improve teaching programs and student learning • Classifying educational content through analysis of lesson plan text • Automatically code medical bills • Interpret doctors notes and apply industry standard codes (ICD-9, ICD-10) • Reduce manual, human intervention • Automatically evaluate email service requests and establishing responses • Shorten process cycle time • Distinguish mortgage, auto, personal, credit card loan applications • Route content to appropriate worker or process step • Automatically understand Personally Identifiable Information (PII), Personal Health Information (PHI) in unstructured content • Take actions such as file, record, route, redact IBM Confidential
Closing Thoughts • Classification is a powerful solution to automate the categorization of text-based content • Properly categorized content provides better accessibility, usability, compliance and analytics • Many factors lead to high-quality classification – consider and understand all of them • They keys to success are planning, preparation and persistence • Is there any project that does not require these? • Automated classification allows you to cut costs associated with content capture, collection, archiving, retention, analysis and more “Anything worth doing, is worth doing right.” – Hunter S. Thompson IBM Confidential