260 likes | 502 Views
Automatic Facets: Faceted Navigation and Entity Extraction. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Introduction: Elements Facets, Taxonomies, Software, People 3 Environments
E N D
Automatic Facets:Faceted Navigation and Entity Extraction Tom ReamyChief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com
Agenda • Introduction: Elements • Facets, Taxonomies, Software, People • 3 Environments • E-Commerce, Enterprise, Internet • Design Issues – Facets and Entities • Conclusion – Integrated Solution
KAPS Group: General • Knowledge Architecture Professional Services • Virtual Company: Network of consultants – 12-15 • Partners – Inxight, FAST, etc. • Consulting, Strategy, Knowledge architecture audit • Taxonomies: Enterprise, Marketing, Insurance, etc. • Services: • Taxonomy development, consulting, customization • Technology Consulting – Search, CMS, Portals, etc. • Metadata standards and implementation • Knowledge Management: Collaboration, Expertise, e-learning • Applied Theory – Faceted taxonomies, complexity theory, natural categories
Elements • Facet – orthogonal dimension of metadata • Entity / Noun Phrase – metadata value of a facet • Entity extraction – feeds facets, signature, ontologies • Taxonomy and categorization rules • Auto-categorization – aboutness, subject facets • People – tagging, evaluating tags, fine tune rules and taxonomy
Essentials of Facets • Facets are not categories • Categories are what a document is about – limited number • Entities are contained within a document – any number • Facets are orthogonal – mutually exclusive – dimensions • An event is not a person is not a document is not a place. • Facets – variety – of units, of structure • Numerical range (price), Location – big to small • Alphabetical, Hierarchical – taxonomic • Facets are designed to be used in combination • Wine where color = red, price = excessive, location = Calirfornia, • And sentiment = snotty
Advantages of Faceted Navigation • More intuitive – easy to guess what is behind each door • Simplicity of internal organization • 20 questions – we know and use • Dynamic selection of categories • Allow multiple perspectives • Ability to Handle Compound Subjects • Systematic Advantages – fewer elements • 4 facets of 10 nodes = 10,000 node taxonomy • Ability to Handle Compound Subjects • Flexible – can be combined with other navigation elements
Essentials of TaxonomiesInternal Organization • Formal Taxonomy – parent – child relationship • Is-A-Kind-Of ---- Animal – Mammal – Zebra • Partonomy – Is-A-Part-Of ---- US-California-Oakland • Browse Classification – cluster of related concepts • Food and Dining – Catering – Restaurants • Taxonomies deal with complex, not compound • Conceptual relationships – category membership • Contextual relationships – Computers & Software • Taxonomies deal with semantics & documents • Multiple meanings and purposes • Essential attributes of documents are not single value
Developing Facets: Tools and TechniquesSoftware Tools • Text Analytics – Taxonomy management, entity extraction, categorization, sentiment • Search – Integrated features, at index, Internet sources • CM – Enterprise environment, taggers and policy • Programmable Rules • Business and Subject matter expertise • Auto-populate variety of metadata – author, title, date, etc. • Relevance – best bets to weights and classes of documents • People – refine, monitor – it’s not automatic
Developing Facets: Tools and TechniquesSoftware Tools – Auto-categorization • Auto-categorization • Training sets – Bayesian, Vector Machine • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Advanced – saved search queries (full search syntax) • NEAR, SENTENCE, PARAGRAPH • Boolean – X NEAR Y and Not-Z • Advanced Features • Facts / ontologies /Semantic Web – RDF + • Sentiment Analysis – positive, negative, neutral
Developing Facets: Tools and TechniquesSoftware Tools – Entity Extraction • Dictionaries – variety of entities, coverage, specialty • Cost of update – service or in-house • Inxight – 50+ predefined entity types • Nstein – 800,000 people, 700,000 locations, 400,000 organizations • Rules • Capitalization, text – Mr., Inc. • Advanced – proximity and frequency of actions, associations • Need people to continually refine the rules • Entities and Categorization • Total number and pattern of entities = a type of aboutness of the document – Bar Code, Fingerprint
Elements: People • Programmers, Librarians, Taxonomists, Metadata specialist • Integrate, design, develop rules, monitor activity & quality • Authors, Subject Matter Experts • Input into design (important facets), rules, activity meaning • Users – Web 2.0 • Feedback – quality and usability • Suggestions – missing terms, bad categorization & entity • Tags Clouds & folksonomy – for social networking features, not for information retrieval
Three Environments • E-Commerce • Catalogs, small uniform collections of entities • Uniform behavior – buy this • Enterprise • More content, more types of content • Enterprise Tools – Search, ECM • Publishing Process – tagging, metadata standards • Internet • Wildly different amount and type of content, no taggers • General Purpose – Flickr, Yahoo • Vertical Portal – selected content, no taggers
Enterprise Environment – When and how add metadata • Enterprise Content – different world than eCommerce • More Content, more kinds, more unstructured • Not a catalog to start – less metadata and structured content • Complexity -- not just content but variety of users and activities • Combination of human and automatic metadata – ECM • Software aided - suggestions, entities, ontologies • Enterprise – Question of Balance / strategy • More facets = more findability (up to a point) • Fewer facets = lower cost to tag documents • Issues • Not enough facets • Wrong set of facets – business not information • Ill-defined facets – too complex internal structure
Facets and Taxonomies Enterprise Environment – Case One – Taxonomy, 7 facets • Taxonomy of Subjects / Disciplines: • Science > Marine Science > Marine microbiology > Marine toxins • Facets: • Organization > Division > Group • Clients > Federal > EPA • Instruments > Environmental Testing > Ocean Analysis > Vehicle • Facilities > Division > Location > Building X • Methods > Social > Population Study • Materials > Compounds > Chemicals • Content Type – Knowledge Asset > Proposals
External Environment – Text Mining, Vertical Portals • Internet Content • Scale – impacts design and technology – speed of indexing • Limited control – Association of publishers to selection of content to none • Major subtypes – different rules – metadata and results • Complex queries and alerts • Terrorism taxonomy + geography + people + organizations • Text Mining • General or specific content and facets and categories • Dedicated tools or component of Portal – internal or external • Vertical Portal • Relatively homogenous content and users • General range of questions
Internet Design • Subject Matter taxonomy – Business Topics • Finance > Currency > Exchange Rates • Facets • Location > Western World > United States • People – Alphabetical and/or Topical - Organization • Organization > Corporation > Car Manufacturing > Ford • Date – Absolute or range (1-1-01 to 1-1-08, last 30 days) • Publisher – Alphabetical and/or Topical – Organization • Content Type – list – newspapers, financial reports, etc.
Integrated Facet ApplicationDesign Issues - General • What is the right combination of elements? • Faceted navigation, metadata, browse, search, categorized search results, file plan • What is the right balance of elements? • Dominant dimension or equal facets • Browse topics and filter by facet • When to combine search, topics, and facets? • Search first and then filter by topics / facet • Browse/facet front end with a search box
Integrated Facet ApplicationDesign Issues - General • Homogeneity of Audience and Content • Model of the Domain – broad • How many facets do you need? • More facets and let users decide • Allow for customization – can’t define a single set • User Analysis – tasks, labeling, communities • Issue – labels that people use to describe their business and label that they use to find information • Match the structure to domain and task • Users can understand different structures
Automatic Facets – Special Issues • Scale requires more automated solutions • More sophisticated rules • Rules to find and populate existing metadata • Variety of types of existing metadata – Publisher, title, date • Multiple implementation Standards – Last Name, First / First Name, Last • Issue of disambiguation: • Same person, different name – Henry Ford, Mr. Ford, Henry X. Ford • Same word, different entity – Ford and Ford • Number of entities and thresholds per results set / document • Usability, audience needs • Relevance Ranking – number of entities, rank of facets
Putting it all together – Infrastructure Solution • Facets, Taxonomies, Software, People • Combine formal power with ability to support multiple user perspectives • Facet System – interdependent, map of domain • Entity extraction – feeds facets, signatures, ontologies • Taxonomy & Auto-categorization – aboutness, subject • People – tagging, evaluating tags, fine tune rules and taxonomy • The future is the combination of simple facets with rich taxonomies with complex semantics / ontologies
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com