640 likes | 805 Views
Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2003. Lecture 07: Controlled Vocabularies. SIMS 202: Information Organization and Retrieval. Some slides in this lecture were developed by Prof. Marti Hearst. Lecture Contents.
E N D
Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2003 Lecture 07: Controlled Vocabularies SIMS 202: Information Organization and Retrieval Some slides in this lecture were developed by Prof. Marti Hearst
Lecture Contents • Phone Project • Review • Metadata Systems • Dublin Core • Controlled Vocabularies • Name Authority Files • Other Types of Controlled Vocabularies • Faceted vs. Hierarchic Organization of Vocabularies • Discussion Questions
Lecture Contents • Phone Project • Review • Metadata Systems • Dublin Core • Controlled Vocabularies • Name Authority Files • Other Types of Controlled Vocabularies • Faceted vs. Hierarchic Organization of Vocabularies • Discussion Questions
Assignments • Assignment 2: Due • Assignment 3: Photo Capture and Annotation • Assigned Sept 18 • Due Sept 23
Phone Project Consent Forms • Collection of Data for the Phone Project • Informed Consent and Release Form • Informed Consent to Release Academic Information • You must sign these forms to receive a phone and participate in the Phone Project • Signing these consent forms is not a condition of your participation in this course, nor will it be used as a basis for grading your performance therein
Collection of Data for the Phone Project • Call logging • All phone calls made from the phones provided to you will be logged. The phone conversations themselves are not going to be recorded, but record will be made of which numbers were called when and for how long. • Approximate location logging • Your approximate location may be logged whenever the phone is used either for phone calls or to take, upload, annotate or retrieve photos. • Data correlation • The information call logging and approximate location logging may be correlated with various other sources of information (e.g., raw location data may be correlated with map data to try to determine in which buildings the phone was used.) • Sublicensing of data collected • Garage Cinema Research may sublicense portions of the collected data to other parties. This may include images of you or provided by you, as well as metadata about you or provided by you. • Privacy projections • Garage Cinema Research will not release your name, email address, or the complete phone numbers of the parties you called, except for their area codes and except for calls made between two Phone Project phones.
Informed Consent and Release Form • License to content • License to the content contributed by you to the system, including but not limited to images, annotations, and annotation frameworks, as well as any data that will be collected in accordance with the privacy protecting measures. • Identifying information and pseudonyms • Use of your name and email address by the system, understanding that they are not going to be released to third parties. Your name will be replaced with a pseudonym if the data is released to third parties. • Personal data collection • Applications built in the system will benefit from the use of personal information, however, you are not required to provide the system with any personal information about yourself or other people beyond the data that is being collected automatically. • Right of inspection/correction/deletion of photos • You have the right to inspect photos of you or information about you submitted by you and/or other users of the system and to have them corrected or removed.
Consent to Release Academic Information • Agreement to post work on IS202 web site • You agree to have your Phone Project course work posted, including your name, on the IS202 web site, which is accessible to the general public. • Understanding of course enrollment and authorship disclosure • You understand that this will publicly reveal that you are a student at the University of California at Berkeley, that you are taking this course, and that you are an author of this work. • Indefinite time period of posting • You understand that my name may be posted on this web site indefinitely, starting in September 2003. • Optional email address posting • The posting of student email addresses on the IS202 web site Phone Project group pages, while kindly requested, is not required.
Lecture Contents • Phone Project • Review • Metadata Systems • Dublin Core • Controlled Vocabularies • Name Authority Files • Other Types of Controlled Vocabularies • Faceted vs. Hierarchic Organization of Vocabularies • Discussion Questions
Metadata • Structures and languages for the description of information resources and their elements (components or features) • “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)
Metadata • Often two main types of metadata are distinguished: • Descriptive metadata • Describes the information/data object and its properties • May use a variety of descriptive formats and rules • Topical metadata • Describes the topic or “aboutness” of an information/data object • May include a variety of vocabularies for describing, subjects, topics, categories, etc.
Metadata Systems and Standards • Naming and ID systems – URLS, ISBNS • Bibliographic description – MARC, Dublin Core, TEI, etc. • Music – SMDL • Images and objects – CIMI, VRA core categories • Numeric data – DDI, SDSM • Geospatial data – FGDC • Collections – EAD
Dublin Core • Simple metadata for describing internet resources • For “Document-Like Objects” • 15 Elements (in base DC)
Title Creator Subject Description Publisher Other Contributors Date Resource Type Format Resource Identifier Source Language Relation Coverage Rights Management Dublin Core Elements
Lecture Contents • Phone Project • Review • Metadata Systems • Dublin Core • Controlled Vocabularies • Name Authority Files • Other Types of Controlled Vocabularies • Faceted vs. Hierarchic Organization of Vocabularies • Discussion Questions
Controlled Vocabularies • Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information • That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata
Controlled Vocabularies • Names and name authorities • Gazetteers (geographic names) • Code lists (e.g., LC language codes) • Subject heading lists • Classification schemes • Thesauri
Control of Names • Cutter’s (1876) objectives of bibliographic description • To enable a person to find a document of which • The author, or • The title, or • The subject is known • To show what a library has • By a given author • On a given subject (and related subjects) • In a given kind (or form) of literature. • First serves access • Second serves collocation
Problems with Names • How many names should be associated with a document? • Which of these should be the “main entry?” • What form should each of the names take? • What references should be made from other possible forms of names that haven’t been used?
The Problem • Proliferation of the forms of names • Different names for the same person • Different people with the same names • Examples • from Books in Print (semi-controlled but not consistent) • ERIC author index (not controlled)
Goethe …etc…
Rules for Description • AACR II and other sets of descriptive cataloging rules provide guidelines for: • Determining the number of name entries • Choosing a main entry • Deciding on the form of name to be used • Deciding when to make references
Authority Control • Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules • If you have rules, why do you need to keep track of all of the headings? Can’t you just infer the headings from the rules?
Conditions of Authorship? • Single person or single corporate entity • Unknown or anonymous authors • Fictitiously ascribed works • Shared responsibility • Collections or editorially assembled works • Works of mixed responsibility (e.g., translations) • Related works
Added Entries • Personal names • Collaborators • Editors, compilers, writers • Translators (in some cases) • Illustrators (in some cases) • Other persons associated with the work (such as the honoree in a festschrift) • Corporate names • Any prominently named corporate body that has involvement in the work beyond publication, distribution, etc.
Choice of Name • AACR II says that the predominant form of the name used in a particular author’s writings should be chosen as the form of name • References should be made from the other forms of the name
Form of the Name • When names appear in multiple forms, one form needs to be chosen • Criteria for choice are: • Fullness (e.g., full names vs. initials only) • Language of the name • Spelling (choose predominant form) • Entry element: • John Smith or Smith, John? • Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973 Different names for the same person
Name Authority Files ID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)
Name Authority Files ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric) Different people writing with the same name
The Haunting of Lauran Paine Batchelor, Reg. Beck, Harry. Bedford, Kenneth. Bosworth, Frank. Bovee, Ruth. Cassidy, Claude. Custer, Clint. Dana, Amber. Dana, Richard. Davis, Audrey. Drexler, J. F. Duchesne, Antoinette. Fisher, Margot. Fleck, Betty. Frost, Joni. Gordon, Angela. Gorman, Beth. Hayden, Jay. Houston, Will. Howard, Troy. Ingersol, Jared. … Kelly, Ray. Ketchum, Jack. Liggett, Hunter. Lucas, J. K. Lyon, Buck. Morgan, Arlene. Morgan, Valerie. O'Connor, Clint. St. George, Arthur. Sharp, Helen. Thorn, Barbara. Archer, Dennis. Clark, Badger. 1. Paine, Lauran. ALSO KNOWN AS: Carrel, Mark. Thompson, Russ. Andrews, A. A. Benton, Will. Bradford, Will. Bradley, Concho. Brennan, Will. Carter, Nevada. Allen, Clay. Almonte, Rosa. Armour, John. Cassady, Claude. Glendenning, Donn. Kelley, Ray. Kilgore, John. Martin, Tom. Slaughter, Jim. Standish, Buck. …
Structure of an IR System Storage Line Interest profiles & Queries Documents & data Search Line Information Storage and Retrieval System Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Formulating query in terms of descriptors Indexing (Descriptive and Subject) Storage of profiles Storage of Documents Store1: Profiles/ Search requests Store2: Document representations Comparison/ Matching Potentially Relevant Documents Adapted from Soergel, p. 19
Uses of Controlled Vocabularies • Library subject headings, classification, and authority files • Commercial journal indexing services and databases • Yahoo, and other web classification schemes • Online and manual systems within organizations • SunSolve • MacArthur
Types of Indexing Languages • Uncontrolled keyword indexing • Indexing languages • Controlled, but not structured • Thesauri • Controlled and structured • Classification systems • Controlled, structured, and coded • Faceted thesauri and classification systems
Indexing Languages • An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents • An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms
Indexing Languages • Library of Congress Subject Headings • Yellow pages topics • Wilson indexes (“reader’s guide”)
Thesauri • A thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among • Synonymous • Equivalent • Broader • Narrower, and • Other related terms • National and international standards for thesauri (More next time)
Classification Systems • A classification system is an indexing language often based on a broad ordering of topical areas • Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics • Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms
Classification Systems (Cont.) • Examples: • The Library of Congress Classification System • The Dewey Decimal Classification System • The ACM Computing Reviews Categories • The American Mathematical Society Classification System
Using Controlled Vocabulary • Start with the text of the document • Attempt to “control” or regularize: • The concepts expressed within • mutually exclusive • exhaustive • The language used to express those concepts • limit the normal linguistic variations • regulate word order and structure of phrases • reduce the number of synonyms or near-synonyms • Also, provide cross-references between concepts and their expression (These slides follow Bates 88) Slide author: Marti Hearst
Classification Schemes • Classify possible concepts. • Goals: • Completely distinct conceptual categories (mutually exclusive) • Complete coverage of conceptual categories (exhaustive) Slide author: Marti Hearst
Descriptors Mix and match Assigning Headings vs. Descriptors • Subject headings • Assign one (or a few) complex heading(s) to the document How would we describe recipes using each technique? Slide author: Marti Hearst
Wilsonline Athletes Athletes -- Heath&hygiene Athletes -- Nutrition Athletes -- Physical Exams … Athletics Athletics -- Administration Athletics -- Equipment -- Catalogs … Sports -- Accidents and Injuries Sports -- Accidents and Injuries -- Prevention ERIC Athletes Athletic Coaches Athletic Equipment Athletic Fields Athletics … Sports Psychology Sportsmanship Subject Heading vs. Descriptors Slide author: Marti Hearst
Describe the contents of an entire document Designed to be looked up in an alphabetical index Look up document under its heading Few (1-5) headings per document AKA: Precoordination Describe one concept within a document Designed to be used in Boolean searching Combine to describe the desired document Many (5-25) descriptors per document AKA: Postcoordination Subject Headings vs. Descriptors Slide author: Marti Hearst
Lecture Contents • Phone Project • Review • Metadata Systems • Dublin Core • Controlled Vocabularies • Name Authority Files • Other Types of Controlled Vocabularies • Faceted vs. Hierarchic Organization of Vocabularies • Discussion Questions
Hierarchical Classification • Each category is successively broken down into smaller and smaller subdivisions • No item occurs in more than one subdivision • Each level divided out by a “character of division” (also known as a feature) • Example: • Distinguish “Literature” based on: • Language • Genre • Time Period Slide author: Marti Hearst