400 likes | 567 Views
LBSC 670. Information Organization. Today. Guest Speaker –Jeremy York – HathiTrust Classification Thoughts and CV Overview & History Related concepts Examples A note on MARC specifications. Classification concpets. Aboutness , specificity, granularity
E N D
LBSC 670 Information Organization
Today • Guest Speaker –Jeremy York – HathiTrust • Classification Thoughts and CV • Overview & History • Related concepts • Examples • A note on MARC specifications
Classification concpets Aboutness, specificity, granularity “Words have power,“ - classification systems exist within a socio-political context Classification methods Manual/automatic, Pre/Post coordinate, Hierarchical/faceted, formal/social
CV overview • What are controlled vocabularies? • Types • Basic concepts • How are cv created and maintained • Metadata standards • Example Systems • When does a CV turn into a KO? • Term Lists, Thesauri, Taxonomies, Ontologies
Controlled Vocabularies “organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search.” (Warner via Leise, Fast) “the primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval” (ANSI Z39.19)
Knowledge Organization • “tools that present the organized interpretation of knowledge structures” (Hjørland) • “classification schemes that organize materials at a general level…, subject headings that provide more detailed access, and authority files that control variant versions of key information” (Hodge)
Uses of controlled vocabulary • Define scope, content, and context of a body of knowledge • Support discovery - Navigation, search, browsing • Map information objects to user terminology • Enforce term consistency and relationships
A good CV. . . • Removes ambiguity • Defines relationships between things • Contextualizes information A+
Content Analysis Ambiguity Synonymy Exhaustivity Specificity Co-extensivity Aboutness Semantic structure Warrant (User, Literary, Organization) Form Analysis Linguistics Grammar Semiotics Single / Multiple terms Indexing & Retrieval Pre vs. Post Coordinate Recall vs. Precision Natural language processing (NLP) CV Concepts http://bit.ly/lbsc_670_cv
Content Analysis • Ambiguity • Each term should relate to a single concept • Synonymy • Each concept should be identified by a single entry • Specificity • Using the most specific words or phrase expressing the subject • Exhaustivity • The extent to which the entire document is indexed (Summarization, depth) • Co-extensivity • “Assign as many terms as needed to bring out the main theme, and according to guidelines sub-themes.” (p. 29, Lancaster) • “nothing more, nothing less” • Semantic Structure • Terms can be related with equivalence, hierarchy, or associated relationships (Use, See, NT, BT, RT)
Content Analysis (2) • Aboutness = Subject/topic? • Wilson (1968) • Author intent, topicality, relationship to other resources, textual analysis • Farithorne (1969) • Intentional aboutness (author), extensional aboutness (document) • Maron (1977) • objective about (document), subjective about (user), and retrieval about (information retrieval) • Hjorland (2001) • “Closely related to theories of meaning, interpretation, and epistemology”
Content Analysis (3) • Wilson’s criteria for evaluating aboutness (1968) • Identify author’s purpose (intent) • Weigh the predominant topics, elements (topical analysis) • Group/count a document’s use of concepts and references (bibliometrics) • Identify essential elements (text analysis)
Content Analysis (4) • Literary Warrant • “The inclusion of a vocabulary term in a controlled vocabulary based on its appearance in one or more content items. For example, a medical text may use the term “oncology.” Based on literary warrant, that term would be included in the controlled vocabulary even though the general public uses the term “cancer.” (Glosso-Thesaurus) • User Warrant • “The inclusion of a vocabulary term in a controlled vocabulary based on use by users. Such terms can be identified through search log analysis or free listing.” (Glosso-Thesaurus) • Organizational Warrant • “Justification for the...selection of a preferred term due to the characteristics and context of the organization using the resource” (ANSI Z39.19)
Form Analysis • Linguistics • Synatx/Form (grammar) • Morphology (internal word structure) • Semantics (meaning) • Pragmatics, discourse analysis (word/phrase use) • Semiotics • study of signs/symbols • Lexical structure • Document layout, markup, tags (think DOM)
Indexing & Retrieval • Pre/Post-Coordinate • Organization prior to retrieval • Organization at the point of retrieval • Recall / Precision • Recall: Number of retrieved relevant docs / total number of docs in collection • Precision: number or retrieved relevant docs / all relevant docs in collection • Natural language processing • Uses semantics and syntax to automatically distill ‘aboutness’
Recall & Precision • A collection of 100 documents • Searches • “Vocabularies” • Recall 100/100 = 1 • Precision 100/100 = 1 • “Facet” • Recall 20/100= .2 • Precision 20/28 = .71 • “OWL” • Recall 1/100 = .001 • Precision 1/1 = 1 Recall = # of docs retrieved / total # of docs in collection Precision = # relevant of docs retrieved / total relevant # of docs in collection
Types of Controlled Vocabularies • Term Lists • Glossaries, Dictionaries, Gazetteers, Folksonomies • Synonym rings • Z39.19 example • Oracle Text • Taxonomies • Website navigation scheme • Thesauri / Ontologies • Authority files, subject thesauri, topic maps
Thesauri & taxonomy examples • List of vocabularies • http://www.slais.ubc.ca/resources/indexing/database1.htm • Taxonomy warehouse • Two Examples • Health & Ageing Thesaurus • Thesaurus of Geographic names
CV Structures • Organization structures • Hierarchical systems • Term Lists / Enumerative systems • Hierarchies • Tees • Facets / Associative relationships • Folksonomies
Features Inclusiveness “Is-a” relationship Inheritance Transitivity Systematic Mutually exclusive Neccesary and sufficient Hierarchies From http://bit.ly/lbsc_670_cv
Relationships • Equivalence ( Term Lists) • “use”, “see”, “isVersionOf”, “isFormatOf” • Hierarchical (Thesauri, Taxonomies) • Generic – “is a” • Partitive – “is part of”, “has part”, “has conceptual part”, “member of” • Instance – • Associative (Facets, Ontologies) • “isReferencedBy”, “isRequiredBy”, “hasDerivative”
Faceted vocabularies Multi-dimensional, multi-relationship driven, Subject, Object, Predicate From http://bit.ly/lbsc_670_cv
Folksonomy • Features • Single level description • Open vocabulary list • User supplied/harvested tags http://trendistic.indextank.com/
Term List Examples • Authority files – Maps to preferred terms • Library of Congress • Encoded Archival Context • Union List of Artist Names • Glossaries/Dictionaries –Words & definitions, sometimes topic focused • Glosso-Thesaurus • Folksonomies – • Contextualization, Trend discovery, Personal Information • Synonym rings – Used for back-end equivalence in searching • Princeton Wordnet
Choosing a framework • Use questions • Who is your user, what are their needs? • What systems are your users familiar with? • Will this system be internal/external? • Content questions • How extensive, defined is the information? • Is your subject matter static or fluid? • What organizational framework best describes your content? • System Questions • What access are you trying to provide? • What external pressures exist? • What external entities/theories will interact with this system?
Thesauri Definitions • “Guide to use of terms, showing relationships between them, for the purpose of providing standardized, controlled vocabulary for information storage and retrieval”(Monash) • “A list of words showing similarities, differences, dependencies, and other relationships to each other”(USG)
Creating a CV (1) • Design methods • Re-use existing, start with content & desired use ideas • Committee / community approach • Top-down • Concept driven • Bottom-up • Document driven • Empirical approach • Deductive approach • Select terms, create relationships, perform term control • Inductive approach • Establish CV at outset, build hierarchies on as needed basis
Top-Down (deductive) Identify audience Identify all topics, concepts, uses, and context of the domain Sort topics identified into an appropriate organization scheme (enumerative, hierarchical, faceted) Solidify structure and clean up gaps & redundancies Assign documents to categories, test retrieval Bottom-up (Inductive) Identify audience Survey documents for topics/concepts. Build system on the fly – let content drive structure and limits of system Identify gap & redundancies in system Test retrieval Creating a CV (2)
Creating a CV (3) • Think about scope, use, content, maintenance • Gather Terms • Based on existing systems, content • Based on user needs/expectations • Investigate issues of specificity, exhaustivity, granularity • Build hierarchies, relationships • Broader/narrower terms, Related terms, Use/Use for, see/see also • Establish Rules • Implement • Evaluate • Maintain http://www.boxesandarrows.com/view/creating_a_controlled_vocabulary
Evaluating a CV • Goals • Determine if the CV solves retrieval needs of user/system • Determine if CV matches user’s content model/term expectations • Methods • Expert evaluation of CV • User based card sorting compared to actual CV • Identification of non-included documents • Analysis of use of system - HCI
CV Maintenance • Primary responsibility • Editor, board, committee • New terms • Is it really new or a different view • What is the proper form & placement • Modified terms • Include a change log • Use a “USE” reference to point to new term • Deleted terms • Unused / Overused terms • May want to keep for historical retrieval purposed • Modification history • Use modification notes, date/time stamps
Case study - MeSH • http://www.nlm.nih.gov/bsd/disted/video/
Thesauri Concepts • Preferred terms • Non-preferred terms • Semantic relations between terms • How to apply terms (guidelines, rules) • Scope notes • Adding terms (How to produce terms that are not listed explicitly in the thesaurus)
Common thesaural identifiers • SN Scope Note • Instruction, e.g. don’t invert phrases • USE Use (another term in preference to this one) • UF Used For • BT Broader Term • NT Narrower Term • RT Related Term
Thesauri Guides • National Information Standards Organization. (2005). Guidelines for the construction, format, and management of monolingual thesauri. ANSI/NISO Z39.19-2005. Bethesda, MD: NISO Press. • http://www.niso.org/standards/resources/Z39-19-2005.pdf?CFID=5559601&CFTOKEN=31747314 • Aitchison, Jean & Gilchirist, Alan. Thesaurus Construction: A Practical Guide. 3rd ed. London: Aslib, 1997. • Willpower Information Management Consultants • http://www.willpower.demon.co.uk/thesprin.htm
Thesaurus Exploration • http://www.getty.edu/research/tools/vocabularies/tgn/ • Protégé introduction and tour • What is protégé? • What is it used for? • How will we use it this semester?
When is a CV an Ontology? • “The study of being or existence” • “A conceptualization of a specification” (Gruber) • “An ontology formally defines a common set of terms that are used to describe and represent a domain.” (OWL)
Webster’s Dictionary • Webster’s Third New International Dictionary defines Ontology as: • A science or study of being, specifically a branch of metaphysics*relating to the nature and relations of being. • A theory concerning the kinds of entities and specifically the kinds of abstract entities that are to be admitted to a language system. *Metaphysics: Nature of being “or” existence.
Next Week • Work time for Protégé • Exploration of ontologies