230 likes | 417 Views
Text Analytics Software Choosing the Right Fit. Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Text Analytics World October 20 New York. Agenda. Introduction – Text Analytics Basics Evaluation Process & Methodology Two Stages – Initial Filters & POC
E N D
Text Analytics SoftwareChoosing the Right Fit Tom ReamyChief Knowledge Architect KAPS Group http://www.kapsgroup.com Text Analytics World October 20 New York
Agenda • Introduction – Text Analytics Basics • Evaluation Process & Methodology • Two Stages – Initial Filters & POC • Proof of Concept • Methodology • Results • Text Analytics and “Text Analytics” • Conclusions
KAPS Group: General • Knowledge Architecture Professional Services • Virtual Company: Network of consultants – 8-10 • Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc. • Consulting, Strategy, Knowledge architecture audit • Services: • Taxonomy/Text Analytics development, consulting, customization • Evaluation of Enterprise Search, Text Analytics • Text Analytics Assessment, Fast Start • Technology Consulting – Search, CMS, Portals, etc. • Knowledge Management: Collaboration, Expertise, e-learning • Applied Theory – Faceted taxonomies, complexity theory, natural categories
Introduction to Text AnalyticsText Analytics Features • Noun Phrase Extraction • Catalogs with variants, rule based dynamic • Multiple types, custom classes – entities, concepts, events • Feeds facets • Summarization • Customizable rules, map to different content • Fact Extraction • Relationships of entities – people-organizations-activities • Ontologies – triples, RDF, etc. • Sentiment Analysis • Rules – Objects and phrases
Introduction to Text AnalyticsText Analytics Features • Auto-categorization • Training sets – Bayesian, Vector space • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Semantic Network – Predefined relationships, sets of rules • Boolean– Full search syntax – AND, OR, NOT • Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE • This is the most difficult to develop • Build on a Taxonomy • Combine with Extraction • If any of list of entities and other words
Evaluation Process & MethodologyOverview • Start with Self Knowledge • Think Big, Start Small, Scale Fast • Eliminate the unfit • Filter One- Ask Experts - reputation, research – Gartner, etc. • Market strength of vendor, platforms, etc. • Feature scorecard – minimum, must have, filter to top 3 • Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus • Filter Three – In-Depth Demo – 3-6 vendors • Deep POC (2) – advanced, integration, semantics • Focus on working relationship with vendor.
Design of the Text Analytics Selection Team Traditional Candidates – IT&, Business, Library • IT - Experience with software purchases, needs assess, budget • Search/Categorization is unlike other software, deeper look • Business -understand business, focus on business value • They can get executive sponsorship, support, and budget • But don’t understand information behavior, semantic focus • Library, KM - Understand information structure • Experts in search experience and categorization • But don’t understand business or technology
Design of the Text Analytics Selection Team • Interdisciplinary Team, headed by Information Professionals • Relative Contributions • IT – Set necessary conditions, support tests • Business – provide input into requirements, support project • Library – provide input into requirements, add understanding of search semantics and functionality • Much more likely to make a good decision • Create the foundation for implementation
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Strategic and Business Context • Info Problems – what, how severe • Strategic Questions – why, what value from the text analytics, how are you going to use it • Platform or Applications? • Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, • Text Analytics Strategy/Model – forms, technology, people • Existing taxonomic resources, software • Need this foundation to evaluate and to develop
Varieties of Taxonomy/ Text Analytics Software • Taxonomy Management • Synaptica, SchemaLogic • Full Platform • SAS, SAP, Smart Logic, Linguamatics, Concept Searching, Expert System, IBM, GATE • Embedded – Search or Content Management • FAST, Autonomy, Endeca, Exalead, etc. • Nstein, Interwoven, Documentum, etc. • Specialty / Ontology (other semantic) • Sentiment Analysis – Lexalytics, Clarabridge, Lots of players • Ontology – extraction, plus ontology
Vendors of Taxonomy/ Text Analytics Software • Attensity • Business Objects – Inxight • Clarabridge • ClearForest • Concept Searching • Data Harmony / Access Innovations • Expert Systems • GATE (Open Source) • IBM Infosphere • Lexalytics • Multi-Tes • Nstein • SAS • SchemaLogic • Smart Logic • Synaptica
Initial Evaluation – Factors Traditional Software Evaluation - Deeper • Basic & Advanced Capabilities • Lack of Essential Feature • No Sentiment Analysis, Limited language support • Customization vs. OOB • Strongest OOB – highest customization cost • Company experience, multiple products vs. platform • Ease of integration – API’s, Java • Internal and External Applications • Technical Issues, Development Environment • Total Cost of Ownership and support, initial price • POC Candidates – 1-4
Initial Evaluation – Factors Case Studies • Amdocs • Customer Support Notes – short, badly written, millions of documents • Total Cost, multiple languages, Integration with their application • Distributed expertise • Platform – resell full range of services, Sentiment Analysis • Twenty to Four to POC (Two) to SAS • GAO • Library of 200 page PDF formal documents, plus public web site • People – library staff – 3-4 taxonomists – centralized expertise • Enterprise search, general public • Twenty to POC with SAS
Phase II - Proof Of Concept - POC Measurable Quality of results is the essential factor 4 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content 2 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of content Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists
POC Design: Evaluation Criteria & Issues • Basic Test Design – categorize test set • Score – by file name, human testers • Categorization & Sentiment – Accuracy 80-90% • Effort Level per accuracy level • Quantify development time – main elements • Comparison of two vendors – how score? • Combination of scores and report • Quality of content & initial human categorization • Normalize among different test evaluators • Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors • Quality of taxonomy – structure, overlapping categories
Text Analytics POC OutcomesEvaluation Factors • Variety & Limits of Content • Twitter to large formal libraries • Quality of Categorization • Scores – Recall, Precision (harder) • Operators – NOT, DIST, START, • Development Environment & Methodology • Toolkit or Integrated Product • Effort Level and Usability • Importance of relevancy – can be used for precision, applications • Combination of workbench, statistical modeling • Measures – scores, reports, discussions
POC and Early Development: Risks and Issues • CTO Problem –This is not a regular software process • Semantics is messy not just complex • 30% accuracy isn’t 30% done – could be 90% • Variability of human categorization • Categorization is iterative, not “the program works” • Need realistic budget and flexible project plan • Anyone can do categorization • Librarians often overdo, SME’s often get lost (keywords) • Meta-language issues – understanding the results • Need to educate IT and business in their language
Text Analytics and “Text Analytics” – Text Mining • TA is pre-processing for text mining • TA adds huge dimensions of unstructured text • Now 85-90% of all content, Social Media • TA can improve the quality of text • Categorization, Disambiguated metadata extraction • Unstructured text into data - What are the possibilities? • New Kinds of Taxonomies – emotion, small smart modular • Information Overload – search, facets, auto-tagging, etc. • Behavior Prediction – individual actions (cancel or not?) • Customer & Business Intelligence – new relationships • Crowd sourcing – technical support • Expertise Analysis – documents, authors, communities
Conclusion • Start with self-knowledge – what will you use it for? • Current Environment – technology, information • Basic Features are only filters, not scores • Integration – need an integrated team (IT, Business, KA) • For evaluation and development • POC – your content, real world scenarios – not scores • Foundation for development, experience with software • Development is better, faster, cheaper • Categorization is essential, time consuming • Text Analytics opens up new worlds of applications
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com