690 likes | 711 Views
Taxonomies: Insuring compatibility and crosswalks. Marjorie M. K. Hlava Access Innovations / Data Harmony mhlava@accessinn.com. Background.
E N D
Taxonomies:Insuring compatibility and crosswalks Marjorie M. K. Hlava Access Innovations / Data Harmony mhlava@accessinn.com
Background • "Underlying the information architecture for web sites and search are taxonomies. The standards for thesauri, taxonomies, ontologies, semantic web and topic maps are converging. • Where do they differ and where are they the same? • This one hour talk will cover the ISO ANSI/NISO and W3C terminology and controlled vocabulary standards, as well as the differences in the new standards compared to the previous editions. • Finally it will talk about the crosswalks and registries underway between these development communities."
What we will cover today • Background • Overview of standards • Specifics on 3 things • NISO Z39.19 • BSI 8723 • IFLA • Thoughts on a registry
Why are taxonomies hot? • Search doesn’t work • Without tagged data • Websites need them to display information • To tag navigation back to content
What’s happening to the business? • Carpet baggers • Differences of opinion • Want to build on existing taxonomies • Need for standards • Need for cross walks • Need for international communication • Need for general registries of taxonomies
The Problem – KEEPING UP • Many players we know and don’t know • Between controlled vocabulary standards • ISO 2788 and 5964, • BSI 8723 • Groups developing guidelines and standards • W3C with SKOS and OWL • Governments world wide developing and mandating taxonomies • Communities • increase reuse • mapping interoperability between controlled vocabularies.
Traditional Standards • ISO • TC 46 • SC 9 • ANSI • NISO • Z39.19 • BSI • BS 8723 • W3C • OWL • SKOS • US Government • Office of Management and Budget • European Union
Thesaurus related • NISO Z39.19 2006 www.niso.org • BSI (BS 8723) the next revised ISO • ISO 2788 - Monolingual (1986) • ISO 5964 - Multilingual (1985) www.iso.ch/iso/en/ISOOnline.frontpage • ISO 5127, Information and documentation Vocabulary • OWL from W3C • SKOS the W3C thesaurus standard
Thesaurus and Indexing Standards – ANSI/NISO • ANSI/NISO Z39.19 - 2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri • NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies • NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devicesby James D. Anderson
The standards • NISO Z39.19 2006 www.niso.org • BSI (BS 8723) - the next revised ISO • ISO 2788 - Monolingual (1986) • ISO 5964 - Multilingual (1985) www.iso.ch/iso/en/ISOOnline.frontpage • ISO 5127 - Information and documentation Vocabulary • OWL from W3C • SKOS - the W3C thesaurus standard
The old standard Coverage documents Types of vocabularies Thesauri Single BT Post-coordinated Printed formats Monolingual vocabularies The revised standard Coverage Content objects Types of vocabularies lists, synonym rings, taxonomy Pre-coordinated Web format Multilingual vocabularies (general) Polyheirachical Interoperability Facet analysis Z39.19 - What’s new?
British Standards - BS 8723 • Structured vocabularies for information retrieval – Guide • Part 1: General • Part 2: Thesauri • Part 3: Vocabularies other than thesauri • Part 4: Interoperability between vocabularies • Part 5: Interoperability with applications
ISO TC 37 Scope of ISO TC 37: Standardization of principles, methods and applications relating to terminology and other language resources. • TC 37/SC 1 - Principles and methods • TC 37/SC 2 - Terminography and lexicography • TC 37/SC 3 - Computer applications for terminology • TC 37/SC 4 - Language resource management
Other ISO standards: Concept-oriented terminology ISO 704:2000 Terminology work - Principles and methods ISO 860:1996 Terminology work - Harmonization of concepts and terms ISO 1087-1:2000 Terminology work - Vocabulary - Part 1: Theory and application ISO 1087-2:2000 Terminology work - Vocabulary - Part 2: Computer applications ISO 10241:1992 Preparation and layout of international terminology standards
Sample ISO - Data Categories • ISO 12200:1999 Computer applications in terminology - Machine-readable terminology interchange format (MARTIF) - Negotiated interchangeISO 12616:2002 Translation-oriented terminographyISO/TR 12618:1994 Computer aids in terminology - Creation and use of terminological databases and text corpora ISO 12620:1999 Computer applications in terminology - Data categories • used to create glossaries
ISOThesaurus and Indexing Standards • ISO 2788:1986 Documentation - Guidelines for the establishment and development of monolingual thesauri • ISO 5964:1985 Documentation - Guidelines for the establishment and development of multilingual thesauri • ISO 5963:1985 Documentation - Methods for examining documents, determining their subjects, and selecting indexing terms • ISO 999:1996 Information and documentation - Guidelines for the content, organization and presentation of indexes
ISO TC 46/SC 9 • Information and Documentation - Identification and Description • TC 46 is ISO's Technical Committee (TC) for information and documentation standards. • SC 9 is the TC 46 Subcommittee (SC) that develops and maintains ISO standards on the identification and description of information resources.
ANSI/NISO Thesaurus and Indexing Standards • ANSI/NISO Z39.19 - 2005 Guidelines for the Construction, Format, and Management of Monolingual Thesauri • NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies • NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devicesby James D. Anderson
Reports to use • Report on the Workshop on Electronic Thesauri, November 4-5, 1999 http://www.niso.org/news/events_workshops/thes99rprt.html • Final Report to the ALCTS/CCS Subject Analysis Committee: Subcommittee on Subject Relationships/Reference StructuresJune 1997 http://archive.ala.org/alcts/organization/ccs/sac/rpt97rev.html
Other links • http://esw.w3.org/topic/SkosDev/ThesaurusLinks/XmlFormats • MARC-21 XMLSchema. • Zthes Z39.50 profile for thesaurus navigation (2001). • TML thesaurus markup language (1999). • ADL Thesaurus Protocol XML formats (2002). • MeSH XML format (2001). • GEMET XML format (2003). • APAIS XML thesaurus format, an extension of Zthes (2000). • Open University thesaurus schemas (2002). • Soergel XML thesaurus specification (2001).
W3C • OWL – Web Ontology Language • RDF – Resource Description Format • Topic Maps • SKOS - Simple Knowledge Organization Systems • Which community to serve? • Build on the current standard • Might make this link next
Other things to watch • Other W3C and ISO areas • Support groups • Blogs • Communities of Practice • SIMILE • Web 2.0 activities • WSDL – Web Services Digital Library
Other Relevant ISO & W3C Standards For translation, terminology and applied linguists go to: http://appling.kent.edu/ResourcePages/LTStandards/Chart/standards.chart.htm#Ontology • Markup Languages • Metadata Resources • Character Coding • Access Protocols and Interoperability • Content Creation, Manipulation, and Maintenance • Authoring Standards • Text and Content Markup • Translation Standards • Terminology and Lexicography Standards • ISO TC 37 Standards • Terminology Interchange Standards • Controlled Language Standards • Taxonomy and Ontology Standards • Corpus Management Standards • Locale-Related Standards
SIMILE • Semantic Interoperability of Metadata and Information in unLike Environments • Forming a data reference for open source taxonomies
Revised Standards for Controlled Vocabularies U.S. Standard (NISO Z39.19 - 2005) British Standard (BS 8723 - 2005) IFLA Guidelines - 2005
U.S. Standard for Controlled Vocabularies – NISO Z39.19 NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies Some of the slides are based on Emily Fayen 2004.6 SLA presentation, Margie Hlava’s talk at 2005 Data Harmony User Group meeting 2005 and Marcia Zeng – NKOS Meeting in Denver
A little bit history… • ANSI/NISO Z39.19,Guidelines for the Construction, Format, and Management of Monolingual Thesauri – 1993 • The most frequently requested NISO Standard • In spite of its age the Standard is still relevant • 1999: NISO Workshop on Electronic Thesauri http://www.niso.org/news/events_workshop/thes99rpt.html • 2002: NISO initiates revision of Z39.19 • 2004: 1993 reaffirmed • 2005 new standard published
Scope • Expand beyond thesaurus • Make more user-friendly • Explain important concepts • Explain principles of vocabulary control • Include electronic information environment • Include additional user search methods: • Browse • Navigate • Keyword searching • Expand beyond A & I services • Include Web applications
The Team: • Vivian Bliss – Microsoft • Carol Brent – ProQuest • John Dickert – DTIC • Lynn El-Hoshy – Library of Congress • Marjorie Hlava – Access Innovations • Stephen Hearn – ALA • Sabine Kuhn – Chemical Abstracts Service • Pat Kuhr – H.W. Wilson Company • Diane McKerlie – DMA Consulting • Peter Morville -- Semantic Studios • Stuart Nelson – National Library of Medicine • Allan Savage – National Library of Medicine • Diane Vizine-Goetz – OCLC • Marcia Lei Zeng – Special Libraries Association
Introduction Scope Referenced Standards Definitions, Abbreviations, and Acronyms Controlled Vocabularies – Purpose, Concepts, Principles, and Structure Term Choice, Scope, and Form Compound Terms Relationships Displaying Controlled Vocabularies Interoperability Construction, Testing, Maintenance, and Management Systems Z39.19 Chapters
The old standard Coverage documents Types of vocabularies Thesauri Single BT Post-coordinated Printed formats Monolingual vocabularies The revised standard Coverage Content objects Types of vocabularies lists, synonym rings, taxonomy Pre-coordinated Web format Multilingual vocabularies (general) Poly hierarchical Interoperability Facet analysis Z39.19 - What’s new?
Principles of Controlled Vocabularies • There are four important principles of vocabulary control that guide their design and development.• eliminating ambiguity• controlling synonyms• establishing relationships among terms where appropriate• testing and validation of terms
Lists A list is a simple group of terms Example: Alabama Alaska Arkansas California Colorado . . . . Frequently used in Web site pick lists and pull down menus
Synonym Rings A synonym ring is a list of synonyms or near synonyms that are used interchangeably for retrieval purposes
Synonym rings are usually found as sets of lists that allow users to access all content containing any of the terms. e.g., cholesterol: Cholesterol Blood Cholesterol Serum Cholesterol Good Cholesterol Bad Cholesterol LDL . . . Synonym Rings-- Examples -- Frequently used in systems where the content is not indexed or the indexing vocabulary is not controlled
An example from International SEMATECH; a search for Silicon would look like this: Your search was submitted as “SILICON” or “SI”
Synonym Rings are used-- • To expand queries for content objects. • any one of these terms retrieves any of the terms in the cluster. • With unstructured natural language format, • interface draws together similar terms • With search engines • Help control of the diversity of the language
Taxonomies A taxonomy is a set of preferred terms, all connected by a hierarchy or polyhierarchy Example: Chemistry Organic chemistry Polymer chemistry Nylon Frequently used in web navigation systems
Thesauri A thesaurus is a controlled vocabulary with multiple types of relationships Example: Rice UF paddy BT Cereals BT Plant products NT Brown rice RT Rice straw
Thesauri (cont.) Relationship types: • Equivalence (Use/Used For) – indicates preferred term in a synonym relationship • Hierarchy – indicates broader and narrower terms • Associative – almost unlimited types of relationships may be used - related It is the most complex format for controlled vocabularies and widely used.
Interoperability • One of the most important issues from the 1999 workshop • Question: How to • compare indexes • perform searches • merge databases that have been developed using different controlled vocabularies?
Interoperability (CONT.) • Factors Affecting Interoperability • Multilingual Controlled Vocabularies • Searching • Indexing • Merging Databases • Merging Controlled Vocabularies • Achieving Interoperability • Storage and Maintenance of Relationships among Terms in Multiple Controlled Vocabularies
II. The British Standard BS 8723: Structured Vocabularies for Information Retrieval – Guide Slides based on the presentation by Stella G Dextre Clarke, Alan Gilchrist ,Leonard Will In ISKO 2004, London
Existing BSI/ISO thesaurus standards • ISO 2788-1986 Guidelines for the establishment and development of monolingual thesauri = BS 5723:1987 • ISO 5964-1985 Guidelines for the establishment and development of multilingual thesauri = BS 6723:1985
What needs updating? • Printed versus electronic application • Guidance on management software • Interoperability: • Mapping between thesauri and other types of vocabulary • Formats/protocols for data exchange with downstream applications • Applicability to end-user applications, not just those for information professionals
Outline of new standard BS 8723: Structured vocabularies for information retrieval – Guide • Part 1 - Definitions, symbols and abbreviations • Part 2 – Thesauri • Part 3 - Vocabularies other than thesauri; • Part 4 - Interoperability between vocabularies • Part 5 - Interoperation between vocabularies and other components of information storage and retrieval systems
Part 3 chapters • Classification schemes • Subject heading lists • Taxonomies • Ontologies • Semantic nets (?) • Search thesauri
Issues for Part 3 • How much guidance is needed on how to build other sorts of vocabulary? • Should we describe the idiosyncrasies of existing schemes, even where we judge there is a ‘better’ way? • Pick out the characteristics of different vocabulary types that govern when and how you can map them. • But some of the observable characteristics might not be what we’d recommend.
Part 4: Interoperability between vocabularies • Huge demand for accessing information • indexed with another language and/or vocabulary. • ‘Mapping’. The Semantic Web is just one application. • Includes multilingual thesauri • special case of mapping between vocabularies. • Applies where • more than one language or vocabulary is in use, • access to all resources is through one vocabulary