340 likes | 576 Views
Knowledge Retrieval Taxonomies & Auto-Categorization. Tom Reamy Knowledge Architect Intranet Consultant. Knowledge Retrieval. Taxonomy: What, Why, How? Taxonomy and Auto-Categorization Approaches and Companies Applied Taxonomies: Content Management, Search Future Directions
E N D
Knowledge Retrieval Taxonomies & Auto-Categorization Tom Reamy Knowledge Architect Intranet Consultant
Knowledge Retrieval • Taxonomy: What, Why, How? • Taxonomy and Auto-Categorization • Approaches and Companies • Applied Taxonomies: • Content Management, Search • Future Directions • Information Retrieval to Knowledge Retrieval
Taxonomy: What • What is a Taxonomy? • Organization: Hierarchical, web, etc. • Card Catalog, Yahoo • Creates a context within which facts are related • Find, Identify, Describe information, relations, context.
Taxonomy: What • Is this a Taxonomy? • Things that begin with the letter A • Things that have 4 legs • Things that are used to write with • Fantasy Animals • Large Orange Objects • Objects used by non-humans for undisclosed purposes. • Jorge Luis Borges
Taxonomy: What • What makes a good taxonomy? • The Library of Congress catalog? • No. Not unless your intranet contains as much information as the LC. • An understandable organization of content that enables people to find information and which supports knowledge discovery.
Taxonomy: Why • Search Stinks • Professionals spend more time looking for information than using it. • Solution: Browse and Search • Need a Taxonomy • It’s ain’t easy, so why do it?
Taxonomy: Why • Cost of poor Search and Content Management • If its not organized, you can’t find it. • If you can’t find it, you can’t use it. • If you can’t find it, you waste a lot of time. • If you can’t find it, you could lose an account. • If you can’t find it, you could look stupid. • If you can’t find it, it doesn’t exist.
Taxonomy: Why • How does a Taxonomy improve Search and Content Management? • Browse and Search works better than Search • ecommerce - 56% of all searches fail = lost income • Intranet - lost time, lost business, lost ideas • Improved Publishing Model: By category, not department • Rich semantic web of concepts, not a unstructured collection of documents
Taxonomy: Why • How does Content Management improve Taxonomies? • CM supports intelligent distributed categorization: • Work Flow: Central and local • Multiple roles: IA, SME, author, editor • CM supports automatic meta data and categorization
Taxonomy: How • Old Answer: Manual • hire a bunch of librarians and IA’s • Costly, difficult to maintain • New Answer: • Cyborg: Manual and Automatic Categorization • Integrate Content Management and Taxonomy • Integrate central IA’s and local authors
Automatic vs. Humanatic • Humans are better, but not as consistent • General bin, understandable mistakes • Bring outside contexts to the document • Purpose, similar documents, common sense • Computers are faster and cheaper. • Faster yes, Cheaper ? • Cost of poorer quality categorization • Intranet: 20,000 users taking 60 seconds longer = $20,000 a week
News Feeds - Corporate Intranets • News Feeds and Content providers • uniform content, size and structure • professional writers • Simple or standard vocabulary • Corporate intranet • Wildly varied content • Mix of good, bad, and ugly writers • Tower of Babel: Acronyms, special meanings
Auto-Categorization: the How • Automatic Methods • Catalog by Example • Training Sets (5-500) • Bag of Words or language and concepts • Statistical Clustering • Set of Documents & Taxonomy Level • Semi-Automatic: Rules
Auto-Categorization: the How • Next Generation • Support Vector Machines • Machine Learning • World Knowledge • Incremental Improvement • From 75% to 85% • Critical Issue: Integration
Autonomy Semio Verity Inxight Topical Net Mohomine Simile H5Technologies YellowBrix GammaSite MetaTagger Applied Semantics Sageware SmartLogik Quiver Stratify Vivisimo Other - Tacit Categorization Explosion
Auto-Categorization: Features • The Categorization Algorithm • SVM – Vector space is an improvement • Higher Accuracy • Fewer documents for training set • White Box – customize recall & precision • Categorize multiple file types & sizes • Clustering – Taxonomy Builder
Auto-Categorization: Features • Support Distributed Activities • Distributed work flow: authors, subject matter experts, information architects • Provisional categorization, keywords, meta data • Automatic summarization • Ease of Use, Integration with CM and Search • Integrate with Rules, Meta Data • Content to Context
Auto-Categorization: Features • Platform for Knowledge Retrieval • World Knowledge • Pre-Built Categories • Rich Semantic Net (WordNet+) • Entity Extraction • Integration • Specialized Audiences & Vocabularies • Content, Expertise, Communities, Activities
The Answer is Cyborg • Automatic Categorization is Not. • Professional Services: Initial Taxonomy • Cyborg: Human and Automatic Integration • Distributed Work Flow • Cyborg Integration with Content Management, Search
Content Management and Taxonomy • Taxonomic Publishing Model • Publish by Category, not web site • Web site the wrong unit of organization • 10 pages to 10,000 pages • 10 users to 20,000 users • 1 activity to 100’s of activities
Content Management and Taxonomy • Content Re-Organization • Support Browse by Topic, Type, Task • Rich Web of Related Content • Product information • Basic Info + background contexts • Legal / Policy contexts • Technical Contexts • Customer / Task contexts
Content Management and Taxonomy • Content Re-Organization: Next Steps • Document can be wrong unit of organization • Information / Learning objects • XML based objects: reuse, combine (relations and contexts) in more flexible and sophsticated ways.
Content Management:Re-organize Authoring • Streamline Authoring • Minimize IT / Web Developer Bottleneck • Integrated Work Flow & Categorization • Central: Librarian and/or Information Architects • Distributed: content owners, authors, SME’s • Distributed Categorization, Meta Data
Applied Taxonomy:Search • Intranet Environments • Case Studies: • Meta Data • Browse / Search Model
Intranet Environments • Global, Distributed • Variety of Documents, People, Activities • 100’s independent Web Sites • Documents, Databases, Applications © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights reserved. (0401-6450)
Title Description Keywords Creator Publisher ContentType Audience SectionName Language Contributor Contributor.Technical Date.Created Date.Review Format Identifier Rights Meta Data: Dublin Core+ © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights reserved. (0401-6450)
ContentType Application Calendar Form FAQ Mission Reference Training Audience Function Project Manager Trainer Enterprise Retail Technology Role Admin Assistant Officer Controlled Vocabularies © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights reserved. (0401-6450)
News Education & Training HR / Benefits Employee Services & Programs Departments Communities Tools, Forms, Calendars How To/ FAQ’s Products Reference & Resources First Generation BrowseTaxonomy © 2001 Charles Schwab & Co., Inc., member NYSE/SIPC. All rights reserved. (0401-6450)
Future Directions • Extending Taxonomies • Richer World Knowledge • Smarter Learning • Additional Content: Databases, Word Docs on network drive, Email • Integration of external content
Future Directions • Integration: Creation to Retrieval • Collaborative Filtering and Categorization • Integration throughout the Enterprise • People, Communities, Expertise • Contexualizing content • Related topics and related contexts • Categories for Stories