430 likes | 592 Views
Semantic Infrastructure Workshop Development. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Text Analytics – Foundation Features and Capabilities Evaluation of Text Analytics Start with Self-Knowledge
E N D
Semantic Infrastructure Workshop Development Tom ReamyChief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com
Agenda • Text Analytics – Foundation • Features and Capabilities • Evaluation of Text Analytics • Start with Self-Knowledge • Features and Capabilities – Filter, Proof of Concept / Pilot • Text Analytics Development • Progressive Refinement • Categorization, Extraction, Sentiment • Case Studies • Best Practices
Semantic Infrastructure - FoundationText Analytics Features • Noun Phrase Extraction • Catalogs with variants, rule based dynamic • Multiple types, custom classes – entities, concepts, events • Feeds facets • Summarization • Customizable rules, map to different content • Fact Extraction • Relationships of entities – people-organizations-activities • Ontologies – triples, RDF, etc. • Sentiment Analysis • Rules – Objects and phrases – positive and negative
Semantic Infrastructure - Foundation Text Analytics Features • Auto-categorization • Training sets – Bayesian, Vector space • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Semantic Network – Predefined relationships, sets of rules • Boolean– Full search syntax – AND, OR, NOT • Advanced – NEAR (#), PARAGRAPH, SENTENCE • This is the most difficult to develop • Build on a Taxonomy • Combine with Extraction • If any of list of entities and other words
Semantic Infrastructure - Foundation Vendors of Taxonomy/ Text Analytics Software • Attensity • SAP - Business Objects – Inxight • Clarabridge • ClearForest • Concept Searching • Data Harmony / Access Innovations • Expert Systems • GATE (Open Source) • IBM Content Analyst • Lexalytics • Multi-Tes • Nstein • SAS - Teragram • SchemaLogic • Smart Logic • Synaptica • Ontology Vendors
Semantic Infrastructure - Foundation Varieties of Taxonomy/ Text Analytics Software • Taxonomy Management • Synaptica, SchemaLogic • Full Platform • SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept Searching, IBM • Content Management • Nstein, Interwoven, Documentum, etc. • Embedded – Search • FAST, Autonomy, Endeca, Exalead, etc. • Specialty • Sentiment Analysis - Lexalytics
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Strategic and Business Context • Info Problems – what, how severe • Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it • Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, • Text Analytics Strategy/Model – forms, technology, people • Existing taxonomic resources, software • Need this foundation to evaluate and to develop
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Do you need it – and what blend if so? • Taxonomy Management Stand alone • Multiple taxonomies, languages, authors-editors • Technology Environment – ECM, Enterprise Search – where is it embedded • Publishing Process – where and how is metadata being added – now and projected future • Can it utilize auto-categorization, entity extraction, summarization • Is the current search adequate – can it utilize text analytics? • Applications – text mining, BI, CI, Alerts?
Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team • Interdisciplinary Team, led by Information Professionals • IT – software experience, budget, support tests • Business – understand business and requirements • Library – understand information structure, understanding of search semantics and functionality • Much more likely to make a good decision • This is not a traditional IT software evaluation – semantics • Create the foundation for implementation
Semantic Infrastructure - Foundation Evaluating Text Analytics Software – Process • Start with Self Knowledge • Eliminate the unfit • Filter One - Ask Experts - reputation, research – Gartner, etc. • Market strength of vendor, platforms, etc. • Feature scorecard – minimum, must have, filter to top 3-4 • Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus • Filter Three – Focus Group one day visit – 3-4 vendors • Deep pilot (2) / POC – advanced, integration, semantics • Two Questions – who is better, can it be done, for how much • Focus on working relationship with vendor.
Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC • Quality of results is the essential factor • 6 weeks POC – bake off / or short pilot • Real life scenarios, categorization with your content • Preparation: • Preliminary analysis of content and users information needs • Set up software in lab – relatively easy • Train taxonomist(s) on software(s) • Develop taxonomy if none available • Six week POC – 3 rounds of development, test, refine / Not OOB • Need SME’s as test evaluators – also to do an initial categorization of content
Semantic Infrastructure - Foundation Evaluating Taxonomy Software - POC • Scenarios – categorization, extraction, summarization, etc. • Majority of time is on auto-categorization • Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time • Elements: • Content • Search terms / search scenarios • Training and test sets • Taxonomy Developers – expert consultants plus internal taxonomists • Evaluate usability in action by taxonomists
Semantic Infrastructure - Foundation Evaluating Taxonomy Software – POC Issues • Quality of content • Quality of initial human categorization • Normalize among different test evaluators • Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors • Quality of taxonomy • General issues – structure (too flat or too deep) • Overlapping categories • Differences in use – browse, index, categorize • Categorization essential issue is complexity of language • Entity Extraction essential issue is scale, disambiguation
Semantic Infrastructure - Foundation Case Study: Telecom Service • Company History, Reputation • Full Platform –Categorization, Extraction, Sentiment • Integration – java, API-SDK, Linux • Multiple languages • Scale – millions of docs a day • Total Cost of Ownership • Ease of Development - new • Vendor Relationship – OEM, etc. • Expert Systems • IBM • SAS - Teragram • Smart Logic • Option – Multiple vendors – Sentiment & Platform • IBM and SAS – finalists
Semantic Infrastructure - Foundation POC Design Discussion: Evaluation Criteria • Basic Test Design – categorize test set • Score – by file name, human testers • Categorization – Call Motivation • Accuracy Level – 80-90% • Effort Level per accuracy level • Sentiment Analysis • Accuracy Level – 80-90% • Effort Level per accuracy level • Quantify development time – main elements • Comparison of two vendors – how score? • Combination of scores and report
Text Analytics POC OutcomesVendor Comparisons • Categorization Results – both good, edge to SAS on precision • Use of Relevancy to set thresholds • Development Environment • IBM as toolkit provides more flexibility but it also increases development effort • Methodology – IBM enforces good method, but takes more time • SAS can be used in exactly the same way • SAS has a much more complete set of operators – NOT, DIST, START
Text Analytics POC OutcomesVendor Comparisons - Functionality • Sentiment Analysis – SAS has workbench, IBM would require more development • SAS also has statistical modeling capabilities • Entity and Fact extraction – seems basically the same • SAS and use operators for improved disambiguation – • Summarization – SAS has built-in • IBM could develop using categorization rules – but not clear that would be as effective without operators • Conclusion: Both can do the job, edge to SAS • Now the fun begins - development
Text Analytics Development: Foundation • Articulated Information Management Strategy (K Map) • Content and Structures and Metadata • Search, ECM, applications - and how used in Enterprise • Community information needs and Text Analytics Team • POC establishes the preliminary foundation • Need to expand and deepen • Content – full range, basis for rules-training • Additional SME’s – content selection, refinement • Taxonomy – starting point for categorization / suitable? • Databases – starting point for entity catalogs
Text Analytics DevelopmentEnterprise Environment – Case Studies • A Tale of Two Taxonomies • It was the best of times, it was the worst of times • Basic Approach • Initial meetings – project planning • High level K map – content, people, technology • Contextual and Information Interviews • Content Analysis • Draft Taxonomy – validation interviews, refine • Integration and Governance Plans
Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets • Taxonomy of Subjects / Disciplines: • Science > Marine Science > Marine microbiology > Marine toxins • Facets: • Organization > Division > Group • Clients > Federal > EPA • Instruments > Environmental Testing > Ocean Analysis > Vehicle • Facilities > Division > Location > Building X • Methods > Social > Population Study • Materials > Compounds > Chemicals • Content Type – Knowledge Asset > Proposals
Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets • Project Owner – KM department – included RM, business process • Involvement of library - critical • Realistic budget, flexible project plan • Successful interviews – build on context • Overall information strategy – where taxonomy fits • Good Draft taxonomy and extended refinement • Software, process, team – train library staff • Good selection and number of facets • Final plans and hand off to client
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets • Taxonomy of Subjects / Disciplines: • Geology > Petrology • Facets: • Organization > Division > Group • Process > Drill a Well > File Test Plan • Assets > Platforms > Platform A • Content Type > Communication > Presentations
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets • Environment Issues • Value of taxonomy understood, but not the complexity and scope • Under budget, under staffed • Location – not KM – tied to RM and software • Solution looking for the right problem • Importance of an internal library staff • Difficulty of merging internal expertise and taxonomy
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets • Project Issues • Project mind set – not infrastructure • Wrong kind of project management • Special needs of a taxonomy project • Importance of integration – with team, company • Project plan more important than results • Rushing to meet deadlines doesn’t work with semantics as well as software
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets • Research Issues • Not enough research – and wrong people • Interference of non-taxonomy – communication • Misunderstanding of research – wanted tinker toy connections • Interview 1 implies conclusion A • Design Issues • Not enough facets • Wrong set of facets – business not information • Ill-defined facets – too complex internal structure
Text Analytics Development Conclusion: Risk Factors • Political-Cultural-Semantic Environment • Not simple resistance - more subtle • – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations • Understanding project scope • Access to content and people • Enthusiastic access • Importance of a unified project team • Working communication as well as weekly meetings
Text Analytics DevelopmentCase Study 2 – POC – Telecom Client • Demo of SAS – Teragram / Enterprise Content Categorization
Text Analytics Development Best Practices - Principles • Importance of ongoing maintenance and refinement • Need dedicated taxonomy team working with SME’s • Work with application developers to incorporate text analytics into new applications • Importance of metrics and feedback • Software and social • Questions: • What are important subjects (and changes) • What information do they need? • How is their information related to other silos?
Text Analytics Development Best Practices - Principles • Process • Realistic Budget – not a nice to have add on • Flexible Project plan - semantics are complex and messy • Time estimates are difficult, object success measures are too • Transition from development to maintenance is fluid • Resources • Interdisciplinary Team is essential • Importance of communication – languages • Merging internal and external expertise
Text Analytics Development Best Practices - Principles • Categorization taxonomy structure • Tradeoff of depth and complexity of rules • Multiple avenues – facets, terms, rules, etc. • No right balance • Recall-precision balance is application specific • Training sets of starting points, rules rule • Need for custom development • Technology • Basic integration – XML • Advanced –combine unstructured and structured in new ways
Text Analytics Development Best Practices – Risk Factors • Value understood, but not the complexity and scope • Project mindset – software project and then done • Not enough research on user information needs, behaviors • Talking to the right people and asking the right questions • Getting beyond “All of the Above” surveys • Not enough resources, wrong resources • Enthusiastic access to content and people • Bad design – starting with the wrong type of taxonomy • Categorization is not library science • More like cognitive anthropology
Semantic Infrastructure Development Conclusion • Text Analytics is the Foundation for Semantic infrastructure • Evaluation of Text Analytics – different than IT software • POC – essential, foundation of development • Difference of taxonomy and categorization • Concepts vs. text in documents • Enterprise Context – strategic, self-knowledge • Infrastructure resource, not a project • Interdisciplinary Team and applications • Integration with other initiatives and technologies • Text Mining, Data Mining, Sentiment & beyond, Everything!
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com