390 likes | 482 Views
Taxonomies and Indexing: A Technical Strategy. Diane Vizine-Goetz Office of Research OCLC Online Computer Library Center, Inc. Context. Techniques and approaches developed by & for libraries and other institutions responsible for preserving the human record Broad scope
E N D
Taxonomies and Indexing: A Technical Strategy Diane Vizine-Goetz Office of Research OCLC Online Computer Library Center, Inc.
Context • Techniques and approaches developed by & for libraries and other institutions responsible for preserving the human record • Broad scope • Long tradition of information organization
Why organize information? • For • Search and retrieval • Use • Preservation & disposition
Why Organize Information by Subject? • Find information on a particular subject • Only and all relevant information • precision • recall • Find related information
How? • Subject analysis • Conceptual analysis--Determining what an information object is “about” • Translate concepts into knowledge organization (KO) scheme • e.g., Subject indexes • Thesauri • Classification scheme • Automated, Semi-automated, Human/Intellectual
Automated Concept Identification • Automated Indexing • Ranges from simply identifying words in a document, to • Sophisticated analyses that identify key names, words, and phrases • WordSmith Project http://orc.rsch.oclc.org:5061/ • Automated Classification • Automated assignment of documents to categories or classes
Political News Concepts Extracted by WordSmith fair housing fair housing act family planning family planning programmes family planning programs family planning services federal government federal government deficit federal reserve federal reserve bank federal reserve board federal reserve chairman alan greenspan federal reserve system
Advantages of automatic concept identification • Inexpensive • Suitable for indexing/categorizing large quantities of text • Can identify popular and emerging concepts and terminology
Why use knowledge organization schemes? • Knowledge organization schemes such as subject heading lists, thesauri, & classification schemes are specialized languages designed for retrieving information • Goal--to reduce ambiguities that cause precision & recall failures
WordSmith family planning family planning programmes family planning programs family planning services Library of Congress Subject Headings (LCSH) Birth control clinics UF Family planning services Planned parenthood services BT Clinics 19860211 Free text v.s. controlled subject retrieval language
Family Planning Note: Programs or services designed to assist the family in controlling reproduction by either improving or diminishing fertility. Entry Term Birth Control Planned Parenthood Basal Body Temperature Method Birth Limiting Births Averted Family Planning Surveys ... Birth control(19880919) UF Family planning Planned parenthood Population control Pregnancy--Prevention BT Hygiene, Sexual Sexual ethics RT Contraception Family size NT Abortion Birth Intervals Childlessness ... MeSH Heading vs. LCSH
Characteristics of subject retrieval languages • Terminology is often domain specific • Medicine > MeSH; Engineering > INSPEC; Agriculture > Agrovoc • Control vocabulary (synonyms & homonyms) • Express relationships between terms
Ei Thesaurus TM Bank protection UF Coastal engineering--Bank protection Inland waterways--Bank protection SN Protection of river banks and lake shores. For seacoasts, use SHORE PROTECTION DT January 1993 BT Protection RT Banks (bodies of water) Coastal engineering Environmental engineering Erosion Inland waterways River control Shore protection Slope protection Soil conservation MC 407.2; 407.3 OC 914.1 Within a domain, terms are context independent
Controlled Vocabulary • Preferred way of expressing a concept • e.g., Popular vs. technical • Heart attack vs. Myocardial infarction • Non-used vocabulary often included • Synonyms • Current/Outdated terms > Disabled/Handicapped • Lexical variants • Phrase/Inverted forms > Bilingual education/Education, Bilingual • Quasi-Synonyms • Synonyms/Antonyms > Literacy/Illiteracy
Relationships • Equivalence • Synonymous terms • Hierarchy • Generic relationship (kind) • Whole-part relationship • Instance relationship (example) • Association
Classification / Categorization System • A systematic arrangement of knowledge into useful categories • General schemes & special schemes • DDC, LCC, UDC & AGRIS, MSC • Present a generalized view of knowledge at varying levels of depth • May be enumerative or synthetic
Some Advantages of Traditional Schemes • Meaningful notation • Well-developed hierarchies • Well-defined categories • Rich network of relationships
Meaningful Notation (DDC) 005.1 Programming 005.1 Programmation 005.1 Программирование 005.1 Programación
DDC Notation Indicates Hierarchy 600 Technology 630 Agriculture 633 Field and plantation crops 633.1 Cereals 633.11 Wheat 633.12 Buckwheat 633.13 Oats
Hierarchies & Categories • Hierarchical from general to specific • Categories have superordinate, coordinate, subordinate relationships in hierarchy • Subcategories must be mutually exclusive
Hierarchies & Categories • Top > Recreation > Automotive > Driving > Road Rage • Social Problems > Public Safety > Traffic Hazards > Highways > Road Rage
Hierarchies, Categories, Relationships 500 Science 510 Mathematics 512 Algebra, number theory 512.3 Fields Class here field theory, Galois theory Class linear algebra in 512.5; class number theory in 512.7
Advantages of Category Schemes • Facilitate retrieval based on concepts not simply keywords • Provide context for search terms (disambiguates) • Facilitate browsing & search refinement
Advantages & Disadvantages of Formal KO Schemes + • Bring like items together • Provide context & show relationships • Support browsing • May accommodate multilingual usage - • Reactive to emerging topics • Terminology may not match users • Not practical to apply to everything
Advantages & Disadvantages of Free Text + • Latest terminology • Application not an issue - • User must to produce synonyms and relationships • Limited browsing • Little multilingual support
Other Solutions • Combine approaches • Map among KO schemes • Map free text terms to KO schemes • Produce supplemental browsable indexes from free text
Resources • ANSI/NISO Z39.19-1993 (Revision of ANSI Z39.19-1980) Guidelines for the Construction, Format, and Management of Monolingual Thesauri <http://www.niso.org/stantech.html#z3919> • Controlled vocabularies, thesauri and classification systems available in the WWW. DC Subject <http://www.lub.lu.se/metadata/subject-help.html> • The Intellectual Foundation of Information Organizationby Elaine Svenonius. MIT Press; ISBN: 0262194333 • List of Web Subject Resources <http://www.loc.gov/catdir/pcc/saco/resources.html> • The Organization of Information (Library and Information Science Text Series) by Arlene G. Taylor. Libraries Unlimited; ISBN: 1563084988 • Resources for Indexers <http://www.asindexing.org/asires.shtml>