240 likes | 339 Views
Text Analytics Workshop Applications. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Text Analytics Applications Integration with Search –Faceted Navigation Integration with ECM Metadata Auto-categorization
E N D
Text Analytics WorkshopApplications Tom ReamyChief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com
Agenda • Text Analytics Applications • Integration with Search –Faceted Navigation • Integration with ECM • Metadata • Auto-categorization • Platform for Information Applications • Enterprise – internal and external • Commercial • Structure for Social
Text Analytics and Search - Elements • Facet – orthogonal dimension of metadata • Entity / Noun Phrase – metadata value of a facet • Entity extraction – feeds facets, signature, ontologies • Taxonomy and categorization rules • Auto-categorization – aboutness, subject facets • People – tagging, evaluating tags, fine tune rules and taxonomy
Essentials of Facets • Facets are not categories • Categories are what a document is about – limited number • Entities are contained within a document – any number • Facets are orthogonal – mutually exclusive – dimensions • An event is not a person is not a document is not a place. • Facets – variety – of units, of structure • Numerical range (price), Location – big to small • Alphabetical, Hierarchical – taxonomic • Facets are designed to be used in combination • Wine where color = red, price = excessive, location = Calirfornia, • And sentiment = snotty
Advantages of Faceted Navigation • More intuitive – easy to guess what is behind each door • Simplicity of internal organization • 20 questions – we know and use • Dynamic selection of categories • Allow multiple perspectives • Ability to Handle Compound Subjects • Systematic Advantages – fewer elements • 4 facets of 10 nodes = 10,000 node taxonomy • Ability to Handle Compound Subjects • Flexible – can be combined with other navigation elements
Developing Facets: Tools and TechniquesSoftware Tools – Entity Extraction • Dictionaries – variety of entities, coverage, specialty • Cost of update – service or in-house • Inxight – 50+ predefined entity types • Nstein – 800,000 people, 700,000 locations, 400,000 organizations • Rules • Capitalization, text – Mr., Inc. • Advanced – proximity and frequency of actions, associations • Need people to continually refine the rules • Entities and Categorization • Total number and pattern of entities = a type of aboutness of the document – Bar Code, Fingerprint • SAS – integration of entities (concepts) and categorization
Three Environments • E-Commerce • Catalogs, small uniform collections of entities • Uniform behavior – buy this • Enterprise • More content, more types of content • Enterprise Tools – Search, ECM • Publishing Process – tagging, metadata standards • Internet • Wildly different amount and type of content, no taggers • General Purpose – Flickr, Yahoo • Vertical Portal – selected content, no taggers
Enterprise Environment – When and how add metadata • Enterprise Content – different world than eCommerce • More Content, more kinds, more unstructured • Not a catalog to start – less metadata and structured content • Complexity -- not just content but variety of users and activities • Combination of human and automatic metadata – ECM • Software aided - suggestions, entities, ontologies • Enterprise – Question of Balance / strategy • More facets = more findability (up to a point) • Fewer facets = lower cost to tag documents • Issues • Not enough facets • Wrong set of facets – business not information • Ill-defined facets – too complex internal structure
Facets and Taxonomies Enterprise Environment –Taxonomy, 7 facets • Taxonomy of Subjects / Disciplines: • Science > Marine Science > Marine microbiology > Marine toxins • Facets: • Organization > Division > Group • Clients > Federal > EPA • Instruments > Environmental Testing > Ocean Analysis > Vehicle • Facilities > Division > Location > Building X • Methods > Social > Population Study • Materials > Compounds > Chemicals • Content Type – Knowledge Asset > Proposals
External Environment – Text Mining, Vertical Portals • Internet Content • Scale – impacts design and technology – speed of indexing • Limited control – Association of publishers to selection of content to none • Major subtypes – different rules – metadata and results • Complex queries and alerts • Terrorism taxonomy + geography + people + organizations • Text Mining • General or specific content and facets and categories • Dedicated tools or component of Portal – internal or external • Vertical Portal • Relatively homogenous content and users • General range of questions
Internet Design • Subject Matter taxonomy – Business Topics • Finance > Currency > Exchange Rates • Facets • Location > Western World > United States • People – Alphabetical and/or Topical - Organization • Organization > Corporation > Car Manufacturing > Ford • Date – Absolute or range (1-1-01 to 1-1-08, last 30 days) • Publisher – Alphabetical and/or Topical – Organization • Content Type – list – newspapers, financial reports, etc.
Integrated Facet ApplicationDesign Issues - General • What is the right combination of elements? • Faceted navigation, metadata, browse, search, categorized search results, file plan • What is the right balance of elements? • Dominant dimension or equal facets • Browse topics and filter by facet • When to combine search, topics, and facets? • Search first and then filter by topics / facet • Browse/facet front end with a search box
Integrated Facet ApplicationDesign Issues - General • Homogeneity of Audience and Content • Model of the Domain – broad • How many facets do you need? • More facets and let users decide • Allow for customization – can’t define a single set • User Analysis – tasks, labeling, communities • Issue – labels that people use to describe their business and label that they use to find information • Match the structure to domain and task • Users can understand different structures
Automatic Facets – Special Issues • Scale requires more automated solutions • More sophisticated rules • Rules to find and populate existing metadata • Variety of types of existing metadata – Publisher, title, date • Multiple implementation Standards – Last Name, First / First Name, Last • Issue of disambiguation: • Same person, different name – Henry Ford, Mr. Ford, Henry X. Ford • Same word, different entity – Ford and Ford • Number of entities and thresholds per results set / document • Usability, audience needs • Relevance Ranking – number of entities, rank of facets
Putting it all together – Infrastructure Solution • Facets, Taxonomies, Software, People • Combine formal power with ability to support multiple user perspectives • Facet System – interdependent, map of domain • Entity extraction – feeds facets, signatures, ontologies • Taxonomy & Auto-categorization – aboutness, subject • People – tagging, evaluating tags, fine tune rules and taxonomy • The future is the combination of simple facets with rich taxonomies with complex semantics / ontologies
Putting it all together – Infrastructure Solution • Integration with ECM • Central Team – • Metadata – Create dictionaries of entities • Develop text analytics catalogs • Publishing Process • Software suggests entities, categorization • Authors task is simple – yes or no, not think of keyword • Enterprise Search • Integrate at metadata level – build advanced presentation and refine results • Integrate into relevance
Text Analytics Platform – Multiple Applications • Platform for Information Applications • Content Aggregation • Duplicate Documents – save millions! • Text Mining – BI, CI – sentiment analysis • Social – Hybrid folksonomy / taxonomy / auto-metadata • Social – expertise, categorize tweets and blogs, reputation • Ontology – travel assistant – SIRI • Use your Imagination!
Text Analytics Platform – Multiple Applications • SIRI – Travel Assistant
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com