290 likes | 485 Views
INLS 520. Information Organization. Review. Last week Types of categorization & classification structures Classification Definitions Look at Library classification systems for Dewey & Library of Congress. Today. Controlled vocabularies Types Basic concepts Related technologies
E N D
INLS 520 Information Organization INLS 520 – Fall 2007 Erik Mitchell
Review • Last week • Types of categorization & classification structures • Classification • Definitions • Look at Library classification systems for Dewey & Library of Congress INLS 520 – Fall 2007 Erik Mitchell
Today • Controlled vocabularies • Types • Basic concepts • Related technologies • Metadata standards • Example Systems • Knowledge organization systems • Term Lists, Thesauri, Taxonomies, Ontologies INLS 520 – Fall 2007 Erik Mitchell
Concepts & definitions • Controlled Vocabularies • “organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search.” (Warner via Leise, Fast) • “the primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval” (ANSI Z39.19) • Knowledge organization systems • “tools that present the organized interpretation of knowledge structures” (Hjørland) • “classification schemes that organize materials at a general level…, subject headings that provide more detailed access, and authority files that control variant versions of key information” (Hodge) • “It depends on what the meaning of the words 'is' is.” (Clinton) INLS 520 – Fall 2007 Erik Mitchell
Uses of controlled vocabulary (1) • Define scope, content, and context of information • Navigation, breadcrumbs • Map to user terminology • Enhance browsing, searching • Term consistency and relationships INLS 520 – Fall 2007 Erik Mitchell
Functions of a CV • Removes ambiguity • Synonyms, Homonyms, polysemes, • Defines relationships • Equivalence, hierarchical, associative (BT, NT, RT, CR) reciprocity, • Provides context • Category, scope, qualifiers, modifiers, scope notes INLS 520 – Fall 2007 Erik Mitchell
Types of Controlled Vocabularies • Term Lists • Glossaries, Dictionaries, Gazetteers, Folksonomies • Synonym rings • Z39.19 example • Oracle Text • Taxonomies • Website navigation scheme • Thesauri / Ontologies • Authority files, subject thesauri, topic maps INLS 520 – Fall 2007 Erik Mitchell
A conceptual map http://www.taxotips.com/ INLS 520 – Fall 2007 Erik Mitchell
Content Analysis Ambiguity Synonymy Exhaustivity Specificity Co-extensivity Aboutness Semantic structure Warrant (User, Literary, Organization) Form Analysis Linguistics Grammar Semiotics Single / Multiple terms Indexing & Retrieval Pre vs. Post Coordinate Recall vs. Precision Natural language processing (NLP) CV Concepts INLS 520 – Fall 2007 Erik Mitchell
Content Analysis (1) • Ambiguity • Each term should relate to a single concpet • Synonymy • Each concept should be identified by a single entry • Specificity • Using the most specific words or phrase expressing the subject • Exhaustivity • The extent to which the entire document is indexed (Summarization, depth) • Co-extensivity • “Assign as many terms as needed to bring out the main theme, and according to guidelines sub-themes.” (p. 29, Lancaster) • “nothing more, nothing less” • Semantic Structure • Terms can be related with equivalence, hierarchy, or associated relationships (Use, See, NT, BT, RT) INLS 520 – Fall 2007 Erik Mitchell
Content Analysis (2) • Aboutness = Subject/topic? • Wilson (1968) • Author intent, topicality, relationship to other resources, textual analysis • Farithorne (1969) • Intentional aboutness (author), extensional aboutness (document) • Maron (1977) • objective about (document), subjective about (user), and retrieval about (information retrieval) • Hjorland (2001) • “Closely related to theories of meaning, interpretation, and epistemology” INLS 520 – Fall 2007 Erik Mitchell
Content Analysis (3) • Wilson’s criteria for evaluating aboutness (1968) • Identify author’s purpose (intent) • Weigh the predominant topics, elements (topical analysis) • Group/count a document’s use of concepts and references (bibliometrics) • Identify essential elements (text analysis) INLS 520 – Fall 2007 Erik Mitchell
Content Analysis (4) • Literary Warrant • “The inclusion of a vocabulary term in a controlled vocabulary based on its appearance in one or more content items. For example, a medical text may use the term “oncology.” Based on literary warrant, that term would be included in the controlled vocabulary even though the general public uses the term “cancer.” (Glosso-Thesaurus) • User Warrant • “The inclusion of a vocabulary term in a controlled vocabulary based on use by users. Such terms can be identified through search log analysis or free listing.” (Glosso-Thesaurus) • Organizational Warrant • “Justification for the...selection of a preferred term due to the characteristics and context of the organization using the resource” (ANSI Z39.19) INLS 520 – Fall 2007 Erik Mitchell
Form Analysis • Linguistics • Synatx/Form (grammar) • Morphology (internal word structure) • Semantics (meaning) • Pragmatics, discourse analysis (word/phrase use) • Semiotics • study of signs/symbols • Lexical structure • Document layout, markup, tags (think DOM) INLS 520 – Fall 2007 Erik Mitchell
Indexing & Retrieval • Pre/Post-Coordinate • Organization prior to retrieval • Organization at the point of retrieval • Recall / Precision • Recall: Number of retrieved relevant docs / total number of docs in collection • Precision: number or retrieved relevant docs / all relevant docs in collection • Natural language processing • Uses semantics and syntax to automatically distill ‘aboutness’ INLS 520 – Fall 2007 Erik Mitchell
Recall & Precision • A collection of 100 documents • Searches • “Vocabularies” • Recall 100/100 = 1 • Precision 100/100 = 1 • “Facet” • Recall 20/100= .2 • Precision 20/28 = .71 • “OWL” • Recall 1/100 = .001 • Precision 1/1 = 1 Recall = # of docs retrieved / total # of docs in collection Precision = # relevant of docs retrieved / total relevant # of docs in collection INLS 520 – Fall 2007 Erik Mitchell
Term List Examples • Authority files – Maps to preferred terms • Library of Congress • Encoded Archival Context • Union List of Artist Names • Glossaries/Dictionaries –Words & definitions, sometimes topic focused • Glosso-Thesaurus • Folksonomies – • Contextualization, Trend discovery, Personal Information • Synonym rings – Used for back-end equivalence in searching • Princeton Wordnet INLS 520 – Fall 2007 Erik Mitchell
Thesauri & taxonomy examples • List of vocabularies • http://www.slais.ubc.ca/resources/indexing/database1.htm • Taxonomy warehouse • Two Examples • Health & Ageing Thesaurus • Thesaurus of Geographic names INLS 520 – Fall 2007 Erik Mitchell
Interoperable system example • NCBI Entrez • 35 databases using interoperable controlled vocabulary systems to provide rich meta-searching • Cross-database discovery – search for “heart attack” • Cross database linking – search for aconitase, follow the “other links” tab. INLS 520 – Fall 2007 Erik Mitchell
Vocabulary and Classification systems - exercise • Break into groups, discuss & list • Goal • Structure • Issues • Benefits • Resources • Kwasnik, Boxes & arrows • Organization structures • Term Lists / Enumerative systems • Hierarchies • Tees • Paradigms • Facets / Associative relationships • Folksonomies INLS 520 – Fall 2007 Erik Mitchell
Choosing a framework • Use questions • Who is your user, what are their needs? • What systems are your users familiar with? • Will this system be internal/external? • Content questions • How extensive, defined is the information? • Is your subject matter static or fluid? • What organizational framework best describes your content? • System Questions • What access are you trying to provide? • What external pressures exist? • What external entities/theories will interact with this system? INLS 520 – Fall 2007 Erik Mitchell
Interoperability issues • Similarity of subject matter in domains • Multiple CV accepted in a domain • Specificity/granularity of content indexing • Use of synonyms, warrant • Intended use, purpose of system INLS 520 – Fall 2007 Erik Mitchell
Creating a CV (1) • Design methods • Re-use existing, start with content & desired use ideas • Committee / community approach • Top-down • Concept driven • Bottom-up • Document driven • Empirical approach • Deductive approach • Select terms, create relationships, perform term control • Inductive approach • Establish CV at outset, build hierarchies on as needed basis INLS 520 – Fall 2007 Erik Mitchell
Top-Down Identify audience Identify all topics, concepts, uses, and context of the domain Sort topics identified into an appropriate organization scheme (enumerative, hierarchical, faceted) Solidify structure and clean up gaps & redundancies Assign documents to categories, test retrieval Bottom-up Identify audience Survey documents for topics/concepts. Build system on the fly – let content drive structure and limits of system Identify gap & redundancies in system Test retrieval Creating a CV (2) INLS 520 – Fall 2007 Erik Mitchell
Creating a CV (3) • Think about scope, use, content, maintenance • Gather Terms • Based on existing systems, content • Based on user needs/expectations • Investigate issues of specificity, exhaustivity, granularity • Build hierarchies, relationships • Broader/narrower terms, Related terms, Use/Use for, see/see also • Establish Rules • Implement • Evaluate • Maintain http://www.boxesandarrows.com/view/creating_a_controlled_vocabulary INLS 520 – Fall 2007 Erik Mitchell
Evaluating a CV • Goals • Determine if the CV solves retrieval needs of user/system • Determine if CV matches user’s content model/term expectations • Methods • Expert evaluation of CV • User based card sorting compared to actual CV • Identification of non-included documents • Analysis of use of system - HCI INLS 520 – Fall 2007 Erik Mitchell
CV Maintenance • Primary responsibility • Editor, board, committee • New terms • Is it really new or a different view • What is the proper form & placement • Modified terms • Include a change log • Use a “USE” reference to point to new term • Deleted terms • Unused / Overused terms • May want to keep for historical retrieval purposed • Modification history • Use modification notes, date/time stamps INLS 520 – Fall 2007 Erik Mitchell
Class exercise • Protégé overview • Orientation • Object types (Classes, Slots, Instances) • Relationships (hierarchies, associative) • Replication of the Glosso-Thesaurus • Visit the Boxes & Arrows Glosso Thesaurus • Look at the data there and come up with a structure in Protégé that allows replication of the thesaurus • Some issues to consider are: • Do you want terms to be classes or instances? • What is the easiest way to show the relationships (broader term, narrower term, etc)? • Do you need to allow multiple relationships for a given type (BT, RT, etc)? • If you have multiple classes, at what level should you create the slots? INLS 520 – Fall 2007 Erik Mitchell
Next Week • More on Knowledge organization systems • Taxonomies, Ontologies • More work with Protégé INLS 520 – Fall 2007 Erik Mitchell