1 / 31

Evidence from Metadata

Explore the use of controlled vocabularies and metadata in information retrieval systems to improve search accuracy and efficiency. Discuss challenges and solutions in concept retrieval.

bmckibben
Download Presentation

Evidence from Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard

  2. Agenda • Questions • Controlled vocabulary retrieval • Generating metadata • Metadata standards • Putting the pieces together

  3. IR System Query Formulation Query Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery Supporting the Search Process Source Selection     

  4. Problems with “Free Text” Search • Homonymy • Terms may have many unrelated meanings • Polysemy (related meanings) is less of a problem • Synonymy • Many ways of saying (nearly) the same thing • Anaphora • Alternate ways of referring to the same thing

  5. Behavior Helps, But not Enough • Privacy limits access to observations • Queries based on behavior are hard to craft • Explicit queries are rarely used • Query by example requires behavior history • “Cold start” problem limits applicability

  6. A “Solution:” Concept Retrieval • Develop a concept inventory • Uniquely identify concepts using “descriptors” • Concept labels form a “controlled vocabulary” • Organize concepts using a “thesaurus” • Assign concept descriptors to documents • Known as “indexing” • Craft queries using the controlled vocabulary

  7. Controlled Vocabulary Searcher Free-Text Searcher Indexer Choose appropriate concept descriptors Construct query from available concept descriptors Construct query from terms that may appear in documents Content-Based Query-Document Matching Metadata-Based Query-Document Matching Query Terms Document Terms Document Descriptors Query Descriptors Retrieval Status Value Two Ways of Searching Author Write the document using terms to convey meaning

  8. Descriptor Doc 1 Doc 2 Canine 0 1 Fox 0 1 Political action 1 0 Volunteerism 1 0 Boolean Search Example Document 1 • Canine AND Fox • Doc 1 • Canine AND Political action • Empty • Canine OR Political action • Doc 1, Doc 2 The quick brown fox jumped over the lazy dog’s back. [Canine] [Fox] Document 2 Now is the time for all good men to come to the aid of their party. [Political action] [Volunteerism]

  9. Applications • When implied concepts must be captured • Political action, volunteerism, … • When terminology selection is impractical • Searching foreign language materials • When no words are present • Photos w/o captions, videos w/o transcripts, … • When user needs are easily anticipated • Weather reports, yellow pages, …

  10. Yahoo

  11. Text Categorization • Goal: fully automatic descriptor assignment • Machine learning approach • Assign descriptors manually for a “training set” • Design a learning algorithm find and use patterns • Bayesian classifier, neural network, genetic algorithm, … • Present new documents • System assigns descriptors like those in training set

  12. f1 f2 f3 f4 … fN v1 v2 v3 v4 … vN Cv w1 w2 w3 w4 … wN Cw Cw Labelled training examples Learner Classifier Cx x1 x2 x3 x4 … xN New example Supervised Learning

  13. Example: kNN Classifier

  14. Machine Assisted Indexing • Goal: Automatically suggest descriptors • Better consistency with lower cost • Approach rule-based expert system • Design thesaurus by hand in the usual way • Design an expert system to process text • String matching, proximity operators, … • Write rules for each thesaurus/collection/language • Try it out and fine tune the rules by hand

  15. Machine Assisted Indexing Example Access Innovations system: //TEXT: science IF (all caps) USE research policy USE community program ENDIF IF (near “Technology” AND with “Development”) USE community development USE development aid ENDIF near: within 250 words with: in the same sentence

  16. Thesaurus Design • Thesaurus must match the document collection • Literary warrant • Thesaurus must match the information needs • User-centered indexing • Thesaurus can help to guide the searcher • Broader term (“is-a”), narrower term, used for, …

  17. Challenges • Changing concept inventories • Literary warrant and user needs are hard to predict • Accurate concept indexing is expensive • Machines are inaccurate, humans are inconsistent • Users and indexers may think differently • Diverse user populations add to the complexity • Using thesauri effectively requires training • Meta-knowledge and thesaurus-specific expertise

  18. Named Entity Tagging • Machine learning techniques can find: • Location • Extent • Type • Two types of features are useful • Orthography • e.g., Paired or non-initial capitalization • Trigger words • e.g., Mr., Professor, said, …

  19. Normalization • Variant forms of names (“name authority”) • Pseudonyms, partial names, citation styles • Acronyms and abbreviations • Organizations, political entities, projects, … • Co-reference resolution • References to roles or objects rather than names • Anaphoric pronouns for an antecedent name

  20. Example: Bibliographic References

  21. Types of Metadata Standards • What can we describe? • Dublin Core • How can we convey it? • Resource Description Framework (RDF) • What can we say? • LCSH, MeSH, … • What does it mean? • Semantic Web

  22. Dublin Core • Goals: • Easily understood, implemented and used • Broadly applicable to many applications • Approach: • Intersect several standards (e.g., MARC) • Suggest only “best practices” for element content • Implementation: • 16 optional and repeatable “elements” • Refined using a growing set of “qualifiers” • “Best practice” suggestions for content standards

  23. Content Title Subject Description Type Audience Coverage Related resource Rights Instantiation Date Format Language Identifier Responsibility Creator Contributor Source Publisher Dublin Core Elements

  24. Resource Description Framework • XML schema for describing resources • Can integrate multiple metadata standards • Dublin Core, P3P, PICS, vCARD, … • Dublin Core provides a XML “namespace” • DC Elements are XML “properties • DC Refinements are RDF “subproperties” • Values are XML “content”

  25. A Rose By Any Other Name … <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://media.example.com/audio/guide.ra"> <dc:creator>Rose Bush</dc:creator> <dc:title>A Guide to Growing Roses</dc:title> <dc:description>Describes process for planting and nurturing different kinds of rose bushes.</dc:description> <dc:date>2001-01-20</dc:date> </rdf:Description> </rdf:RDF>

  26. Semantic Web • RDF provides the schema for interchange • Ontologies support automated inference • Similar to thesauri supporting human reasoning • Ontology mapping permits distributed creation • This is where the magic happens 

  27. Adversarial IR • Search is user-controlled suppression • Everything is known to the search system • Goal: avoid showing things the user doesn’t want • Other stakeholders have different goals • Authors risk little by wasting your time • Marketers hope for serendipitous interest • Metadata from trusted sources is more reliable

  28. Index Spam • Goal: Manipulate rankings of an IR system • Multiple strategies: • Create bogus user-assigned metadata • Add invisible text (font in background color, …) • Alter your text to include desired query terms • “Link exchanges” create links to your page

  29. Putting It All Together

  30. Before You Go! On a sheet of paper, please briefly answer the following question (no names): What was the muddiest point in today’s lecture?

More Related