320 likes | 1.08k Views
Applying Semantics to Search Text Analytics. Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Enterprise Search Summit New York. Agenda. Introduction – Search, Semantics, Text Analytics How do you mean? Getting (Re)Started with Text Analytics – 3 ½ steps
E N D
Applying Semantics to SearchText Analytics Tom ReamyChief Knowledge Architect KAPS Group http://www.kapsgroup.com Enterprise Search Summit New York
Agenda • Introduction – Search, Semantics, Text Analytics • How do you mean? • Getting (Re)Started with Text Analytics – 3 ½ steps • Preliminary: Strategic Vision • What is text analytics and what can it do? • Step 1: Self Knowledge – TA Audit • Step 2: Text Analytics Software Evaluation • Step 3: POC / Quick Start – Pilot to Development • Rest of your Life: Refinement, Feedback, Learning • Conclusions
KAPS Group: General • Knowledge Architecture Professional Services – Network of Consultants • Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching • Attensity, Clarabridge, Lexalytics, • Strategy– IM & KM - Text Analytics, Social Media, Integration • Services: • Taxonomy/Text Analytics development, consulting, customization • Text Analytics Fast Start – Audit, Evaluation, Pilot • Social Media: Text based applications – design & development • Clients: • Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, etc. • Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Presentations, Articles, White Papers – http://www.kapsgroup.com
Introduction: Search, Semantics, Text AnalyticsWhat do you mean? • All Search is (should be) semantic • Humans search concepts not chicken scratches • Is this semantics? • NLP, Concept Search, Semantic Web (ontologies) • Meaning in Text • Text Analytics – categorization • Extraction – noun phrase, facts-triples • Meaning from Search Results • A conversation, not a list of ranked (poorly) documents
What is Text Analytics?Text Analytics Features • Noun Phrase Extraction • Catalogs with variants, rule based dynamic • Multiple types, custom classes – entities, concepts, events • Feeds facets • Summarization • Customizable rules, map to different content • Fact Extraction • Relationships of entities – people-organizations-activities • Ontologies – triples, RDF, etc. • Sentiment Analysis • Rules & statistical – Objects, products, companies, and phrases
What is Text Analytics?Text Analytics Features • Auto-categorization • Training sets – Bayesian, Vector space • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Semantic Network – Predefined relationships, sets of rules • Boolean– Full search syntax – AND, OR, NOT • Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE • This is the most difficult to develop • Build on a Taxonomy • Combine with Extraction • If any of list of entities and other words • Disambiguation - Ford
Preliminary: Text Analytics VisionWhat can Text Analytics Do? • Strategic Questions – why, what value from the text analytics, how are you going to use it • Platform or Applications? • What are the basic capabilities of Text Analytics? • What can Text Analytics do for Search? • After 10 years of failure – get search to work? • What can you do with smart search based applications? • RM, PII, Social • ROI for effective search – difficulty of believing • Problems with metadata, taxonomy
Preliminary: Text Analytics VisionAdding Structure to Unstructured Content • How do you bridge the gap – taxonomy to documents? • Tagging documents with taxonomy nodes is tough • And expensive – central or distributed • Library staff –experts in categorization not subject matter • Too limited, narrow bottleneck • Often don’t understand business processes and business uses • Authors – Experts in the subject matter, terrible at categorization • Intra and Inter inconsistency, “intertwingleness” • Choosing tags from taxonomy – complex task • Folksonomy – almost as complex, wildly inconsistent • Resistance – not their job, cognitively difficult = non-compliance • Text Analytics is the answer(s)!
Preliminary: Text Analytics VisionAdding Structure to Unstructured Content • Text Analytics and Taxonomy Together – Platform • Text Analytics provides the power to apply the taxonomy • And metadata of all kinds • Consistent in every dimension, powerful and economic • Hybrid Model • Publish Document -> Text Analytics analysis -> suggestions for categorization, entities, metadata - > present to author • Cognitive task is simple -> react to a suggestion instead of select from head or a complex taxonomy • Feedback – if author overrides -> suggestion for new category • Facets – Requires a lot of Metadata - Entity Extraction feeds facets • Hybrid – Automatic is really a spectrum – depends on context • Automatic – adding structure at search results
Step 1 : TA Information Audit Start with Self Knowledge • Info Problems – what, how severe • Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, • Contextual interviews, content analysis, surveys, focus groups, ethnographic studies, Text Mining • Category modeling – Cognitive Science – how people think • Natural level categories mapped to communities, activities • Novice prefer higher levels • Balance of informative and distinctiveness • Text Analytics Strategy/Model – forms, technology, people
Step 1 : TA Information Audit Start with Self Knowledge • Ideas – Content and Content Structure • Map of Content – Tribal language silos • Structure – articulate and integrate • Taxonomic resources • People – Producers & Consumers • Communities, Users, Central Team • Activities – Business processes and procedures • Semantics, information needs and behaviors • Information Governance Policy • Technology • CMS, Search, portals, text analytics • Applications – BI, CI, Semantic Web, Text Mining
Step 2: TA EvaluationVarieties of Taxonomy/ Text Analytics Software • Taxonomy Management - extraction • Full Platform • SAS, SAP, Smart Logic, Concept Searching, Expert System, IBM, Linguamatics, GATE • Embedded – Search or Content Management • FAST, Autonomy, Endeca, Vivisimo, NLP, etc. • Interwoven, Documentum, etc. • Specialty / Ontology (other semantic) • Sentiment Analysis – Attensity, Lexalytics, Clarabridge, Lots • Ontology – extraction, plus ontology
Step 2: Text Analytics EvaluationDifferent Kind of software evaluation • Traditional Software Evaluation - Start • Filter One- Ask Experts - reputation, research – Gartner, etc. • Market strength of vendor, platforms, etc. • Feature scorecard – minimum, must have, filter to top 6 • Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus • Filter Three – In-Depth Demo – 3-6 vendors • Reduce to 1-3 vendors • Vendors have different strengths in multiple environments • Millions of short, badly typed documents, Build application • Library 200 page PDF, enterprise & public search
Design of the Text Analytics Selection Team Traditional Candidates – IT&, Business, Library • IT - Experience with software purchases, needs assess, budget • Search/Categorization is unlike other software, deeper look • Business -understand business, focus on business value • They can get executive sponsorship, support, and budget • But don’t understand information behavior, semantic focus • Library, KM - Understand information structure • Experts in search experience and categorization • But don’t understand business or technology
Design of the Text Analytics Selection Team • Interdisciplinary Team, headed by Information Professionals • Relative Contributions • IT – Set necessary conditions, support tests • Business – provide input into requirements, support project • Library – provide input into requirements, add understanding of search semantics and functionality • Much more likely to make a good decision • Create the foundation for implementation
Step 3: Proof of Concept / Pilot Project 4 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content 2 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of content Measurable Quality of results is the essential factor Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists
Step 3 : Proof of ConceptPOC Design: Evaluation Criteria & Issues • Basic Test Design – categorize test set • Score – by file name, human testers • Categorization & Sentiment – Accuracy 80-90% • Effort Level per accuracy level • Quantify development time – main elements • Comparison of two vendors – how score? • Combination of scores and report • Quality of content & initial human categorization • Normalize among different test evaluators • Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors • Quality of taxonomy – structure, overlapping categories
Step 3: Proof of ConceptPOC and Early Development: Risks and Issues • CTO Problem –This is not a regular software process • Semantics is messy not just complex • 30% accuracy isn’t 30% done – could be 90% • Variability of human categorization • Categorization is iterative, not “the program works” • Need realistic budget and flexible project plan • Anyone can do categorization • Librarians often overdo, SME’s often get lost (keywords) • Meta-language issues – understanding the results • Need to educate IT and business in their language
Step 3: Proof of Concept / Quick StartOutcomes • POC – understand how text analytics can work in your environment • Learn the software – internal resources trained by doing • Learn the language – syntax (Advanced Boolean) • Learn categorization and extraction • Good categorization rules • Balance of general and specific • Balance of recall and precision • Develop or refine taxonomies for categorization • POC – can be the Quick Start or the Start of the Quick Start
Development, ImplementationQuick Start – First Application: Search and TA • Simple Subject Taxonomy structure • Easy to develop and maintain • Combined with categorization capabilities • Added power and intelligence • Combined with people tagging, refining tags • Combined with Faceted Metadata • Dynamic selection of simple categories • Allow multiple user perspectives • Can’t predict all the ways people think • Monkey, Banana, Panda • Combined with ontologies and semantic data • Multiple applications – Text mining to Search • Combine search and browse
3. Roles and Responsibilities Sample roles matrix:
3. Roles and Responsibilities Common Roles and SharePoint Permissions:
Rest of Your Life: Maintenance, Refinement, Application, Learning • This is easy – if you did the TA Audit and POC/Quick Start • Content – new content – calls for flexible, new methods • People – Have a trained team and extended team • Technology – integrate into variety of applications – SBA • Processes, workflow – how semi-automate, part of normal • Maintenance – Refinement – in world of rapid change • Mechanisms for feedback, learning – of text analysts and software • Future Directions - Advanced Applications • Embedded Applications, Semantic Web + Unstructured Content • Integration of Enterprise and External - Social Media • Expertise Analysis, Behavior Prediction (Predictive Analytics) • Voice of the Customer, Big Data • Turning unstructured content into data – new worlds
Conclusion • Text Analytics can fulfill the promise of taxonomy and metadata • Economic and consistent structure for unstructured content • Search and Text Analytics • Search that works – finally! • Platform for Search-Based Applications • Text Analytics is different kind of software / solution • Infrastructure – Hybrid CM to Search and feedback • How to Get Started with Text Analytics • Strategic Vision of Text Analytics • Three steps – TA Audit, TA evaluation, POC/Quick Start • Text Analytics opens up new worlds of applications
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com www.TextAnalyticsWorld.com Oct 3-4, Boston