320 likes | 496 Views
SemTech Text Analytics Evaluation. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Text Analytics Features, Varieties, Vendors Evaluation Process Start with Self-Knowledge Text Analytics Team
E N D
SemTechText AnalyticsEvaluation Tom ReamyChief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com
Agenda • Text Analytics Features, Varieties, Vendors • Evaluation Process • Start with Self-Knowledge • Text Analytics Team • Features and Capabilities – Filter • Proof of Concept/Pilot • Themes and Issues • Case Study • Conclusion
KAPS Group: General • Knowledge Architecture Professional Services • Virtual Company: Network of consultants – 8-10 • Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc. • Consulting, Strategy, Knowledge architecture audit • Services: • Taxonomy/Text Analytics development, consulting, customization • Technology Consulting – Search, CMS, Portals, etc. • Evaluation of Enterprise Search, Text Analytics • Metadata standards and implementation • Knowledge Management: Collaboration, Expertise, e-learning • Applied Theory – Faceted taxonomies, complexity theory, natural categories
Introduction to Text AnalyticsText Analytics Features • Noun Phrase Extraction (Entity, Concept, Events, etc.) • Catalogs with variants, rule based dynamic • Multiple types, custom classes – entities, concepts, events • Feeds facets • Summarization • Customizable rules, map to different content • Fact Extraction • Relationships of entities – people-organizations-activities • Ontologies – triples, RDF, etc. • Sentiment Analysis • Statistical, rules – full categorization set of operators
Introduction to Text AnalyticsText Analytics Features • Auto-categorization • Training sets – Bayesian, Vector space • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Semantic Network – Predefined relationships, sets of rules • Boolean– Full search syntax – AND, OR, NOT • Advanced – NEAR (#), PARAGRAPH, SENTENCE • This is the most difficult to develop • Build on a Taxonomy • Combine with Extraction • If any of list of entities and other words
Varieties of Taxonomy/ Text Analytics Software • Taxonomy Management • Synaptica, SchemaLogic • Full Platform • SAS-Teragram, SAP-Inxight, Smart Logic, Data Harmony, Concept Searching, Expert System, IBM, GATE • Content Management – embedded • Embedded – Search • FAST, Autonomy, Endeca, Exalead, etc. • Specialty • Sentiment Analysis , VOC – Lexalytics, Attensity / Reports • Ontology – extraction, plus ontology
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Strategic and Business Context • Info Problems – what, how severe • Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it • Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, application specific initiatives • Text Analytics Strategy/Model – forms, technology, people • Existing taxonomic resources, software • Need this foundation to evaluate and to develop
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Do you need it – and what blend if so? • Taxonomy Management Full Functionality • Multiple taxonomies, languages, authors-editors • Technology Environment – Text Mining, ECM, Enterprise Search • Where is it embedded, integration issues • Publishing Process – where and how is metadata being added – now and projected future • Can it utilize auto-categorization, entity extraction, summarization • Applications – text mining, BI, CI, Social Media, Mobile?
Design of the Text Analytics Selection Team • Traditional Candidates - IT • Experience with large software purchases • Search/Categorization is unlike other software • Experience with needs assessments • Need more – know what questions to ask, knowledge audit • Objective criteria • Looking where there is light? • Asking IT to select text analytics software is like asking a construction company to select the design of your house. • They have the budget • OK, they can play.
Design of the Text Analytics Selection Team • Traditional Candidates - Business Owners • Understand the business • But don’t understand information behavior • Focus on business value, not technology • Focus on semantics is needed • Asking business owners to select text analytics software is like asking a restaurant owner to do the cooking • They can get executive sponsorship, support, and budget. • OK, they can play
Design of the Text Analytics Selection Team • Traditional Candidates - Library • Understand information structure • But not how it is used in the business • Experts in search experience and categorization • Suitable for experts, not regular users • Asking librarians to select text analytics software is like asking an accountant to establish your financial strategy • Experience with variety of search engines, taxonomy software, integration issues • OK, they can play
Design of the Text Analytics Selection Team • Interdisciplinary Team, headed by Information Professionals • Relative Contributions • IT – Set necessary conditions, support tests • Business – provide input into requirements, support project • Library – provide input into requirements, add understanding of search semantics and functionality • Much more likely to make a good decision • Create the foundation for implementation
Evaluating Text Analytics Software – Process • Start with Self Knowledge • Eliminate the unfit • Filter One- Ask Experts - reputation, research – Gartner, etc. • Market strength of vendor, platforms, etc. • Filter Two - Feature scorecard – minimum, must have, filter • Filter Three – Technology Filter – match to your overall scope and capabilities – Filter not a focus • Filter Four – Focus Group one day visit – 3-4 vendors • Deep pilot (2) / POC – advanced, integration, semantics • Focus on working relationship with vendor.
Initial Evaluation Example Outcomes • Filter One: • Company A, B – sentiment analysis focus, weak categorization • Company C – Lack of full suite of text analytics • Company D – business concerns, support • Open Source – license issues • Ontology Vendors – missing categorization capabilities • 4 Demos • Saw a variety of different approaches, but • Company X – lacking sentiment analysis, require 2 vendors • Company Y – lack of language support, development cost
Evaluating Taxonomy SoftwarePOC - Approach • Quality of results is the essential factor • 6 weeks POC – bake off / or short pilot • Real life scenarios, categorization with your content • Preparation: • Preliminary analysis of content and users information needs • Set up software in lab – relatively easy • Train taxonomist(s) on software(s) • Develop taxonomy if none available • Six week POC – 3 rounds of development, test, refine / Not OOB • Need SME’s as test evaluators – also to do an initial categorization of content
Evaluating Taxonomy SoftwarePOC – Initial Design • Majority of time is on auto-categorization • Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time • Risks – getting software installed and working, getting the right content, initial categorization of content • Elements: • Content • Search terms / search scenarios • Training sets • Test sets of content • Development Team – expert consultants plus internal taxonomists, technical
Evaluating Taxonomy SoftwarePOC – Range of Evaluations Basic – Can this stuff work at all? Auto-categorization to existing taxonomy – variety of content Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc. Evaluate usability in action by taxonomists Integration – with ontologies Output – XML, API’s
Evaluating Text Analytics SoftwarePOC - Issues • Quality of content – range of issues – spelling to size to ? • Quality of initial human categorization • Normalize among different test evaluators • Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors • Quality of taxonomy • General issues – structure (too flat or too deep) • Overlapping categories • Differences in use – browse, index, categorize • Categorization essential issue is complexity of language • Entity Extraction essential issue is scale and disambiguation
Evaluating Text Analytics Software Risks • CIO/CTO Problem –This is not a regular software process • Language is messy not just complex • 30% accuracy isn’t 30% done – could be 90% • Variability of human categorization / expression • Even professional writers – journalists examples • Categorization is iterative, not “the program works” • Need realistic budget and flexible project plan • Anyone can do categorization • Librarians often overdo, SME’s often get lost (keywords) • Meta-language issues – understanding the results • Need to educate IT and business in their language
Case Study: Telecom Service • Company History, Reputation • Full Platform –Categorization, Extraction, Sentiment • Integration – java, API-SDK, Linux • Multiple languages • Scale – millions of docs a day • Total Cost of Ownership • Ease of Development - new • Vendor Relationship – OEM, etc. • Expert Systems • IBM • SAS • Smart Logic • Option – Multiple vendors – Sentiment & Platform
POC Design Discussion: Evaluation Criteria • Basic Test Design – categorize test set • Score – by file name, human testers • Categorization • Accuracy Level – 80-90% • Effort Level per accuracy level • Sentiment Analysis • Accuracy Level – 80-90% • Effort Level per accuracy level • Quantify development time – main elements • Comparison of two vendors – how score? • Combination of scores and report
Text Analytics POC OutcomesVendor Comparisons • Categorization Results – both good, edge to SAS on precision • Use of Relevancy to set thresholds • Development Environment • IBM as toolkit provides more flexibility but it also increases development effort • Methodology – IBM enforces good method, but takes more time • SAS can be used in exactly the same way • SAS has a much more complete set of operators – NOT, DIST, START
Text Analytics POC OutcomesVendor Comparisons - Functionality • Sentiment Analysis – SAS has workbench, IBM would require more development • SAS also has statistical modeling capabilities • Entity and Fact extraction – seems basically the same • SAS and use operators for improved disambiguation – • Summarization – SAS has built-in • IBM could develop using categorization rules – but not clear that would be as effective without operators • Conclusion: Both can do the job, edge to SAS
Conclusion • Start with self-knowledge – what will you use it for? • Current Environment – technology, information • Basic Features are only filters, not scores • Integration – need an integrated team (IT, Business, KA) • For evaluation and development • POC – your content, real world scenarios – not scores • Foundation for development, experience with software • Development is better, faster, cheaper • Categorization is essential, time consuming • Next: Text Analytics + Semantic Web + Ontology • Integration of Data and Text Mining • Mutual Enrichment – smarter data, richer analytics
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com