1 / 34

Knowledge Retrieval Taxonomies & Auto-Categorization

Knowledge Retrieval Taxonomies & Auto-Categorization. Tom Reamy Knowledge Architect Intranet Consultant. Knowledge Retrieval. Taxonomy: What, Why, How? Taxonomy and Auto-Categorization Approaches and Companies Applied Taxonomies: Content Management, Search Future Directions

saddam
Download Presentation

Knowledge Retrieval Taxonomies & Auto-Categorization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge Retrieval Taxonomies & Auto-Categorization Tom Reamy Knowledge Architect Intranet Consultant

  2. Knowledge Retrieval • Taxonomy: What, Why, How? • Taxonomy and Auto-Categorization • Approaches and Companies • Applied Taxonomies: • Content Management, Search • Future Directions • Information Retrieval to Knowledge Retrieval

  3. Taxonomy: What • What is a Taxonomy? • Organization: Hierarchical, web, etc. • Card Catalog, Yahoo • Creates a context within which facts are related • Find, Identify, Describe information, relations, context.

  4. Taxonomy: What • Is this a Taxonomy? • Things that begin with the letter A • Things that have 4 legs • Things that are used to write with • Fantasy Animals • Large Orange Objects • Objects used by non-humans for undisclosed purposes. • Jorge Luis Borges

  5. Taxonomy: What • What makes a good taxonomy? • The Library of Congress catalog? • No. Not unless your intranet contains as much information as the LC. • An understandable organization of content that enables people to find information and which supports knowledge discovery.

  6. Taxonomy: Why • Search Stinks • Professionals spend more time looking for information than using it. • Solution: Browse and Search • Need a Taxonomy • It’s ain’t easy, so why do it?

  7. Taxonomy: Why • Cost of poor Search and Content Management • If its not organized, you can’t find it. • If you can’t find it, you can’t use it. • If you can’t find it, you waste a lot of time. • If you can’t find it, you could lose an account. • If you can’t find it, you could look stupid. • If you can’t find it, it doesn’t exist.

  8. Taxonomy: Why • How does a Taxonomy improve Search and Content Management? • Browse and Search works better than Search • ecommerce - 56% of all searches fail = lost income • Intranet - lost time, lost business, lost ideas • Improved Publishing Model: By category, not department • Rich semantic web of concepts, not a unstructured collection of documents

  9. Taxonomy: Why • How does Content Management improve Taxonomies? • CM supports intelligent distributed categorization: • Work Flow: Central and local • Multiple roles: IA, SME, author, editor • CM supports automatic meta data and categorization

  10. Taxonomy: How • Old Answer: Manual • hire a bunch of librarians and IA’s • Costly, difficult to maintain • New Answer: • Cyborg: Manual and Automatic Categorization • Integrate Content Management and Taxonomy • Integrate central IA’s and local authors

  11. Automatic vs. Humanatic • Humans are better, but not as consistent • General bin, understandable mistakes • Bring outside contexts to the document • Purpose, similar documents, common sense • Computers are faster and cheaper. • Faster yes, Cheaper ? • Cost of poorer quality categorization • Intranet: 20,000 users taking 60 seconds longer = $20,000 a week

  12. News Feeds - Corporate Intranets • News Feeds and Content providers • uniform content, size and structure • professional writers • Simple or standard vocabulary • Corporate intranet • Wildly varied content • Mix of good, bad, and ugly writers • Tower of Babel: Acronyms, special meanings

  13. Auto-Categorization: the How • Automatic Methods • Catalog by Example • Training Sets (5-500) • Bag of Words or language and concepts • Statistical Clustering • Set of Documents & Taxonomy Level • Semi-Automatic: Rules

  14. Auto-Categorization: the How • Next Generation • Support Vector Machines • Machine Learning • World Knowledge • Incremental Improvement • From 75% to 85% • Critical Issue: Integration

  15. Autonomy Semio Verity Inxight Topical Net Mohomine Simile H5Technologies YellowBrix GammaSite MetaTagger Applied Semantics Sageware SmartLogik Quiver Stratify Vivisimo Other - Tacit Categorization Explosion

  16. Auto-Categorization: Features • The Categorization Algorithm • SVM – Vector space is an improvement • Higher Accuracy • Fewer documents for training set • White Box – customize recall & precision • Categorize multiple file types & sizes • Clustering – Taxonomy Builder

  17. Auto-Categorization: Features • Support Distributed Activities • Distributed work flow: authors, subject matter experts, information architects • Provisional categorization, keywords, meta data • Automatic summarization • Ease of Use, Integration with CM and Search • Integrate with Rules, Meta Data • Content to Context

  18. Auto-Categorization: Features • Platform for Knowledge Retrieval • World Knowledge • Pre-Built Categories • Rich Semantic Net (WordNet+) • Entity Extraction • Integration • Specialized Audiences & Vocabularies • Content, Expertise, Communities, Activities

  19. The Answer is Cyborg • Automatic Categorization is Not. • Professional Services: Initial Taxonomy • Cyborg: Human and Automatic Integration • Distributed Work Flow • Cyborg Integration with Content Management, Search

  20. Content Management and Taxonomy • Taxonomic Publishing Model • Publish by Category, not web site • Web site the wrong unit of organization • 10 pages to 10,000 pages • 10 users to 20,000 users • 1 activity to 100’s of activities

  21. Content Management and Taxonomy • Content Re-Organization • Support Browse by Topic, Type, Task • Rich Web of Related Content • Product information • Basic Info + background contexts • Legal / Policy contexts • Technical Contexts • Customer / Task contexts

  22. Content Management and Taxonomy • Content Re-Organization: Next Steps • Document can be wrong unit of organization • Information / Learning objects • XML based objects: reuse, combine (relations and contexts) in more flexible and sophsticated ways.

  23. Content Management:Re-organize Authoring • Streamline Authoring • Minimize IT / Web Developer Bottleneck • Integrated Work Flow & Categorization • Central: Librarian and/or Information Architects • Distributed: content owners, authors, SME’s • Distributed Categorization, Meta Data

  24. Applied Taxonomy:Search • Intranet Environments • Case Studies: • Meta Data • Browse / Search Model

  25. Intranet Environments • Global, Distributed • Variety of Documents, People, Activities • 100’s independent Web Sites • Documents, Databases, Applications © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights reserved. (0401-6450)

  26. Title Description Keywords Creator Publisher ContentType Audience SectionName Language Contributor Contributor.Technical Date.Created Date.Review Format Identifier Rights Meta Data: Dublin Core+ © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights reserved. (0401-6450)

  27. ContentType Application Calendar Form FAQ Mission Reference Training Audience Function Project Manager Trainer Enterprise Retail Technology Role Admin Assistant Officer Controlled Vocabularies © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights reserved. (0401-6450)

  28. News Education & Training HR / Benefits Employee Services & Programs Departments Communities Tools, Forms, Calendars How To/ FAQ’s Products Reference & Resources First Generation BrowseTaxonomy © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights reserved. (0401-6450)

  29. Future Directions • Extending Taxonomies • Richer World Knowledge • Smarter Learning • Additional Content: Databases, Word Docs on network drive, Email • Integration of external content

  30. Future Directions • Integration: Creation to Retrieval • Collaborative Filtering and Categorization • Integration throughout the Enterprise • People, Communities, Expertise • Contexualizing content • Related topics and related contexts • Categories for Stories

More Related