Text Analytics Workshop Evaluation of Software

Text Analytics WorkshopEvaluation of Software Tom ReamyChief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

Agenda • Features, Varieties, Vendors • Enterprise Context • Start with Self-Knowledge • Text Analytics Team • Evaluation Process • Features and Capabilities – Filter • Proof of Concept / Pilot

Text Analytics Software – Features • Entity Extraction • Multiple types, custom classes – entities, concepts, events • Auto-categorization – Taxonomy Structure • Training sets – Bayesian, Vector space • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Boolean– Full search syntax – AND, OR, NOT • Advanced – NEAR (#), PARAGRAPH, SENTENCE • Advanced Features • Facts / ontologies /Semantic Web – RDF + • Sentiment Analysis

Varieties of Taxonomy/ Text Analytics Software • Taxonomy Management • Synaptica, SchemaLogic • Full Platform • SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept Searching, IBM • Content Management • Nstein, Interwoven, Documentum, etc. • Embedded – Search • FAST, Autonomy, Endeca, Exalead, etc. • Specialty • Sentiment Analysis - Lexalytics

Vendors of Taxonomy/ Text Analytics Software • Attensity • Business Objects – Inxight • Clarabridge • ClearForest • Data Harmony / Access Innovations • GATE (Open Source) • IBM Content Analyst • Lexalytics • Multi-Tes • Nstein • SAS - Teragram • SchemaLogic • Smart Logic • Synaptica • Wikionomy • Wordmap • Lots More

Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Strategic and Business Context • Info Problems – what, how severe • Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it • Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, • Text Analytics Strategy/Model – forms, technology, people • Existing taxonomic resources, software • Need this foundation to evaluate and to develop

Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Do you need it – and what blend if so? • Taxonomy Management Only • Multiple taxonomies, languages, authors-editors • Technology Environment – ECM, Enterprise Search – where is it embedded • Publishing Process – where and how is metadata being added – now and projected future • Can it utilize auto-categorization, entity extraction, summarization • Is the current search adequate – can it utilize text analytics? • Applications – text mining, BI, CI, Alerts?

Design of the Text Analytics Selection Team • Traditional Candidates - IT • Experience with large software purchases • Search/Categorization is unlike other software • Experience with needs assessments • Need more – know what questions to ask, knowledge audit • Objective criteria • Looking where there is light? • Asking IT to select taxonomy software is like asking a construction company to select the design of your house. • They have the budget • OK, they can play.

Design of the Text Analytics Selection Team • Traditional Candidates - Business Owners • Understand the business • But don’t understand information behavior • Focus on business value, not technology • Focus on semantics is needed • They can get executive sponsorship, support, and budget. • OK, they can play

Design of the Text Analytics Selection Team • Traditional Candidates - Library • Understand information structure • But not how it is used in the business • Experts in search experience and categorization • Suitable for experts, not regular users • Experience with variety of search engines, taxonomy software, integration issues • OK, they can play

Design of the Text Analytics Selection Team • Interdisciplinary Team, headed by Information Professionals • Relative Contributions • IT – Set necessary conditions, support tests • Business – provide input into requirements, support project • Library – provide input into requirements, add understanding of search semantics and functionality • Much more likely to make a good decision • Create the foundation for implementation

Evaluating Text Analytics Software – Process • Start with Self Knowledge • Eliminate the unfit • Filter One- Ask Experts - reputation, research – Gartner, etc. • Market strength of vendor, platforms, etc. • Feature scorecard – minimum, must have, filter to top 3 • Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus • Filter Three – Focus Group one day visit – 3-4 vendors • Deep pilot (2) / POC – advanced, integration, semantics • Focus on working relationship with vendor.

Evaluating Text Analytics SoftwareFeature Checklist and Score: Basic Features, Admin • New, copy, rename, delete, merge • Branches not just nodes • Scope Notes • Spell check • Search – all parts and selected (only taxonomy nodes) • Names and Identifiers for terms and nodes • Check for duplicates • Versioning, multiple authors • Analytical reports – structure, application to documents

Evaluating Text Analytics SoftwareFeature Checklist and Score: Usability • Ease of use – copy, paste, rename, merge, etc. • User Documentation, user manuals, on-line help, training and tutorials • Visualization • file structure, tree, Hierarchy and alphabetical • Automatic Taxonomy/Node & Rule Generation • Nonsense for Taxonomy • Node – suggestions for sub-categories, rules • Variety of node relationships – child-parent, related

Evaluating Text Analytics SoftwareFeature Checklist and Score: Additional Features • Language support – international - If you have need for it • Scalability – Size of taxonomy rarely important • More important for auto-categorization • Import-Export – XML and SKOS • Support standards – NISO, etc., Mapping between taxonomies • API / SDK • Security, Access Rights, Roles • Advanced Features – future growth • Facts / ontologies /Semantic Web – RDF + • Sentiment Analysis

Evaluating Text Analytics SoftwareAdvanced Features – Text Analytics as Platform • Entity Extraction • Multiple types, custom classes • Summarization • Customizable rules, map to different content • Auto-categorization • Training sets • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Advanced – saved search queries (full search syntax) • NEAR, SENTENCE, PARAGRAPH • Boolean – X NEAR Y and Not-Z

Evaluating Taxonomy SoftwarePOC • Quality of results is the essential factor • 6 weeks POC – bake off / or short pilot • Real life scenarios, categorization with your content • Preparation: • Preliminary analysis of content and users information needs • Set up software in lab – relatively easy • Train taxonomist(s) on software(s) • Develop taxonomy if none available • Six week POC – 3 rounds of development, test, refine / Not OOB • Need SME’s as test evaluators – also to do an initial categorization of content

Evaluating Taxonomy SoftwarePOC • Majority of time is on auto-categorization • Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time • Risks – getting software installed and working, getting the right content, initial categorization of content • Elements: • Content • Search terms / search scenarios • Training sets • Test sets of content • Taxonomy Developers – expert consultants plus internal taxonomists

Evaluating Taxonomy SoftwarePOC Test Cases: Auto-categorization to existing taxonomy – variety of content Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc. Evaluate usability in action by taxonomists

Evaluating Taxonomy SoftwarePOC - Issues • Quality of content • Quality of initial human categorization • Normalize among different test evaluators • Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors • Quality of taxonomy • General issues – structure (too flat or too deep) • Overlapping categories • Differences in use – browse, index, categorize • IMPORTANT!!!

Conclusion • Start with self-knowledge – what will you use it for? • Current Environment – technology, information • Basic Features are only filters, not scores • Integration – need an integrated team (IT, Business, KA) • For evaluation and development • POC – your content, real world scenarios – not scores • Foundation for development, experience with software • Development is better, faster, cheaper • Categorization is essential, time consuming • Categorization essential issue is complexity of language • Entity Extraction essential issue is scale

Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

Text Analytics Workshop Evaluation of Software

Text Analytics Workshop Evaluation of Software

Presentation Transcript

Text Analytics Workshop Development

Text Analytics Workshop

Text Analytics Summit Text Analytics Evaluation

Text Analytics World Future Directions of Text Analytics

Text Analytics Workshop Applications

807 - TEXT ANALYTICS

807 - TEXT ANALYTICS

807 - TEXT ANALYTICS

807 - TEXT ANALYTICS

807 - TEXT ANALYTICS

Text Analytics Software Choosing the Right Fit

807 - TEXT ANALYTICS

807 - TEXT ANALYTICS

SemTech Text Analytics Evaluation

807 - TEXT ANALYTICS

807 - TEXT ANALYTICS

807 - TEXT ANALYTICS

Text Analytics Workshop

807 TEXT ANALYTICS

Text Analytics Mini-Workshop Quick Start