1 / 52

Implementing a Government-wide Semantic Solution to Thesauri

Kenneth B. Sall , Science Applications International Corporation (SAIC) and Ronald P. Reck , RRecktek LLC November 15, 2005 XML 2005 , Atlanta, GA. Implementing a Government-wide Semantic Solution to Thesauri. Agenda. Problem Goals and Requirements Early Thesaurus Attempts

nash
Download Presentation

Implementing a Government-wide Semantic Solution to Thesauri

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kenneth B. Sall, Science Applications International Corporation (SAIC) and Ronald P. Reck, RRecktek LLC November 15, 2005 XML 2005, Atlanta, GA Implementing a Government-wide Semantic Solution to Thesauri

  2. Agenda • Problem • Goals and Requirements • Early Thesaurus Attempts • Thesauri Standards and Specifications • Basic Thesaurus Terminology • SKOS (Simple Knowledge Organisation System) • Our SKOS Element Subset and Extensions • SKOSaurus Pilot (updated since September 2005 paper) • Next Steps

  3. Problem Statement • Government agencies need common vocabulary of (technical) terminology. • Communication and data sharing is greatly enhanced when the semantics are clear. • Various government groups approach this in different ways -- Microsoft Word, Excel, HTML, databases, and wiki pages: bulleted lists, tables, spreadsheets, acronym lists, etc. • Need to focus on a common formats and standards that enable reuse and harmonization across Communities of Interest (COIs).

  4. Goals and Requirements • Allow grouping terms in one COI or sharing across COIs. • Should benefit from ISO standards for thesauri. • Enable term authors to use familiar tools (e.g., Excel). • XML-based (RDF) solution with few required elements but many optional and/or repeatable elements. • Multiple definitions of the same term must be permitted, with either same or different subject/context. • Should support semantic relationships between terms (synonyms, related-to, broader-than, narrower-than); search thesaurus. • [Many more in paper.]

  5. Early Thesaurus Attempts See http://kensall.com/gov/glossary/#older

  6. XSLT-Generated Search Links AcronymFinder; WikiPedia; Clusty; Clusty Gov [.gov and .mil]; Google Uncle Sam [.gov and .mil]; Google Define; Google; Merriam-Webster; W3C; W3Schools; Webopedia; WhatIs; WordNet; ZVON.

  7. Thesauri Standards and Specifications • ISO 2788:1986 – Documentation - Guidelines for the establishment and development of monolingual thesauri • Developing a Thesaurus (mono-lingual) • ISO 5964:1986 – multi-lingual version • ISO 1087:2000 - Vocabulary of Terminology • ISO 704:2000 - Principles and Methods • ANSI/NISO Z39.19-2003 - Construction, Format, and Management • ISO 15836:2003 - The Dublin Core metadata element set • [Many more listed in paper.]

  8. Basic Thesaurus Terminology [ISO 2788:1986] • Thesaurus – list of concepts in a particular domain of knowledge together with explicit relationships • Concept - unit of thought that exists in the mind as an abstract entity, independent of the term(s) that identify it • Concept Scheme - set of concepts, optionally including statements about semantic relationships between those concepts. • Thesauri, classification schemes, subject heading lists, taxonomies, terminologies, glossaries and other types of controlled vocabularies

  9. Basic Thesaurus Terminology [ISO 2788:1986] • USE (or SEE) – preferred label for this concept • UF = Use For – alternate label, may be a synonym but less preferred [e.g., birds USE FOR Aves] • SN = Scope Note - to clarify or constrain the meaning; sometimes contains the concept’s definition • BT = Broader Than – another concept more general than this concept • NT = Narrower Than – more specialized than this concept • RT = Related To – concept that is similar in some way

  10. Thesaurus Concept Example Source: GAO Thesaurus, Feb. 2005

  11. SKOS (Simple Knowledge Organisation System) • Leverages ISO 2788 (and ISO 5964) • Semantic Web Best Practices and Deployment Working Group – W3C • SKOS Working Drafts (W3C) and Related Efforts • SKOS Core Guide • SKOS Core Vocabulary Specification • Quick Guide to Publishing a Thesaurus on the Semantic Web • SKOS Mapping • SKOS Extensions • SKOS API • Development Wiki

  12. SKOS • “SKOS Core is a model for expressing the structure and content of concept schemes (thesauri, classification schemes, subject heading lists, taxonomies, terminologies, glossaries and other types of controlled vocabulary).” • “The SKOS Core Vocabulary is an application of the Resource Description Framework (RDF), that can be used to express a concept scheme as an RDF graph. Using RDF allows data to be linked to and/or merged with other RDF data by semantic web applications.” • It uses RDFS Classes and RDF Properties to describe Concepts and Concept Schemes. Source: SKOS Core Guide, May 2005

  13. SKOS Vocabulary Implemented 9 of the 26 SKOS Properties

  14. Our SKOS Element Subset and Extensions, 1 • skos:Concept – used for Containment • skos:prefLabel, skos:altLabel – used for Lexical Labeling • skos:related, skos:narrower, skos:broader – used for Semantic Relationships • skos:scopeNote, skos:definition, skos:example – used for Documentation • skos:subject – used for Indexing (future)

  15. Our SKOS Element Subset and Extensions, 2 • skos:Concept – contains all statements about properties for one concept • skos:prefLabel – USE; preferred handle for this concept; designator. [In SKOS, no two concepts in the same concept scheme may have same prefLabel.] • skos:altLabel – UF; alternate handle; spelling variants; can be used for abbreviations or acronyms (but we don’t) • skos:related, skos:narrower, skos:broader – associated with, more specific, or more general than this concept

  16. Our SKOS Element Subset and Extensions, 3 • skos:scopeNote – constrains meaning; ISO 2788 allows definitions to appear here (but we don’t) • skos:definition – statement or formal explanation of the meaning of a concept • skos:example – used in a sentence • skos:subject – topic; can be a skos:broader

  17. Our SKOS Element Subset and Extensions, 4 • Pilot Extensions (non-SKOS) • ABBREVIATON_OR_ACRONYM – common government need (could define as rdfs:subPropertyOfskos:altLabel) • SOURCE - official document names and URLs are preferred, but specific names of people or agencies are acceptable; (probably could define as rdfs:subPropertyOfskos:note) • COI – essentially a skos:Collection (with a potential skos:ConceptScheme)

  18. Back to SKOSaurus Bird Example as RDF Graph

  19. Illustrative Statements • An alternate label (skos:altLabel) for "bird" is "Aves". • The concepts with the preferred label "vertebrate" and "animal" are broader than the concept with the preferred label "bird". • There are four specializations of birds listed ("robin", "hawk", "sparrow" and "eagle"), each indicated as skos:narrower than "bird". • The concepts "lizard" and "reptile" are skos:related to the "bird" concept in some way. • Among various concepts which might have the skos:prefLabel of "bird", the one illustrated is constrained to ornithology, according to skos:scopeNote. This distinguishes the concept from "bird", such as in the informal term for a (young) woman.

  20. OWL Statements About SKOS • skos:broader owl:inverseOf skos:narrower • skos:narrower owl:inverseOf skos:broader and • skos:broader is an owl:TransitiveProperty • skos:narrower is an owl:TransitiveProperty • RDF/OWL version of SKOS Core

  21. SKOSaurus Pilot • Proof of concept • Many simplifying assumptions • Fabricated data (except for DTIC) • About 100 man hours • Ron Reck and Ken Sall

  22. SKOSaurus Pilot: Environment • The host operating system is Microsoft Windows XP with Service Pack 2. • Dell Latitude D800 (1.69GHz) with 1G of RAM. • The Windows XP host runs VMware 5.0 build 13124 to emulate a machine onto which the Solaris X86 operating system version 10 is installed. • This is referred to as the guest operating system which runs the SKOSaurus system, consisting of: • Perl version 5.8.7 and various Perl modules • Java version 1.4.2.08 • Kowari server 1.1.0 Pre2 • XSLT stylesheets

  23. Main Use Cases (for Pilot) • Concept Entry via Web Form • File Upload of Excel Spreadsheet (as CSV) • File Upload of SKOS (or RDF) • Query of Concept Data Store

  24. Example Spreadsheet: birds

  25. Spreadsheet Conventions (Pilot) • One row per concept, sparse or densely populated. • New row for different definition or homonym (e.g., bird). [SKOS conflict: duplicate prefLabels.] • The heading row should not be removed or modified. • Column order is invariant. • Since several elements are repeatable, use semi-colon to indicate iteration. • A limitation in our pilot parser requires the author to use the pipe symbol ("|") instead of a comma within a cell. • Any number of rows can be included, but there must be no blank rows or separator rows. • File > Save As Comma Separated Values (*.csv).

  26. SKOSaurus: Home

  27. SKOSaurus: Manage COIs

  28. SKOSaurus: Upload CSV or SKOS

  29. SKOSaurus: Upload Feedback Generated SKOS files Datastore for COI

  30. SKOSaurus: Generated SKOS Excerpt, 1 of 2

  31. SKOSaurus: Generated SKOS Excerpt, 2 of 2

  32. SKOSaurus: Web Form

  33. SKOSaurus: Kowari Model Dump: Query

  34. SKOSaurus: Kowari Model Dump: Result

  35. SKOSaurus: Intuitive Search

  36. SKOSaurus: Intuitive Search: bird Result Note: 2 altLabels, really 2 different concepts.

  37. SKOSaurus: Intuitive Search: animal, reptile Jump to Graph

  38. SKOSaurus: Intuitive Search: eagle

  39. SKOSaurus: Intuitive Search: bald eagle

  40. SKOSaurus: Intuitive Search: dame

  41. SKOSaurus: DTIC Data: 73,500 lines Source: Defense Technical Information Center

  42. SKOSaurus: DTIC Search: ANATOMY

  43. SKOSaurus: DTIC Search: ANATOMY

  44. SKOSaurus: DTIC Search: BIOLOGY

  45. SKOSaurus: DTIC Search: BIOPHYSICS, PHYSICS

  46. Design Issue: Bootstrapping • Unique numeric URI • Pro: designed to avoid collisions (duplicate prefLabels) • Con: not intuitive and not transparent for humans • URI based on concept’s skos:prefLabel • Pro: transparent; less query intensive since we can map from prefLabel directly to URI • Con: easy to have collisions; need rules for dealing with: • Capitalization • White space • Singular vs. plural forms • Non-alphabetic characters vs. acceptable URI characters

  47. Design Issue: URI Format • <skos:Concept rdf:about="http://skos.rrecktek.com/drm/bird#concept"> • <skos:Concept rdf:about="http://skos.rrecktek.com/drm/concept#bird"> • <skos:Concept rdf:about="http://skos.rrecktek.com/drm/bird/concept"> • <skos:Concept rdf:about="http://skos.rrecktek.com/drm/concept/bird">

  48. Next Steps – 1 of 3 • Normalize Definitions - A submission-and-approval process can be added which requires each Community of Interest (COI) to designate an owner who can promote entries from the candidate status to the approved status. • Edits of Existing Terms • Add Query across COIs. • Add skos:historyNote - to record CSV file upload date, contributor; update with modifications. • If a database can output data in RDF or SKOS format, the SKOSaurus system could permit an upload or SOAP entry of the data without modification. • Migrate from an existing representation (database) to SKOS or RDF (e.g., Oracle 10G Release 2 has exciting promise for supporting RDF).

  49. Next Steps – 2 of 3 • Consider other conversion processes (e.g., Microsoft Word tables). • Integrate Google define:, WikiPedia, and other generated search links. • Create XSLT and XSL-FO stylesheets for printing, probably with an interface to select a subset of concepts to print. • Consider a central or federal data store for all concepts for multiple COIs and agencies with system management roles at COI and agency level, as well as across the entire data store. • Enhance the web form so that repeatable fields (narrower, broader, related, altLabel, etc.) can be entered more than once.

  50. Next Steps – 3 of 3 • Establish clear authoring conventions • Case convention (UpperCamelCase, Title Case, lowercase, all caps?) • Pluralization (use singular form, but what about irregular cases: men/man, geese/goose, etc.) • Compound terms (e.g., Data Architecture, Data Class) • Placement of acronym/abbreviation (separate element) • Placement of source (separate element) • Citation method (URIs, bibliographical, free form?) Source could contain child elements for each possible format.

More Related