340 likes | 484 Views
The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance of Normalization For the Data Search. SiCOP Conference 6 February 2007. Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center. Presentation Guide. Presentation Guide.
E N D
The Value of Controlled Vocabularies Beyond the DRM 2.0: The Importance of Normalization For the Data Search SiCOP Conference6 February 2007 Lola M. Olsen Global Change Master Directory NASA’s Goddard Space Flight Center
Presentation Guide Presentation Guide • The Global Change Master Directory (GCMD) and the DRM 2.0 • Vision and Mission • Data Context, Data Descriptions, and Data Sharing • Content and Usage • Value (most appreciated aspects expressed from users) • Design Evolution • MD2 - MD9.7 • The Data Reference Model 3.0
The GCMD Vision and Mission • Strategic Vision: • To serve as a trusted source for Earth (and space) science metadata and related services. • To contribute to scientific discovery. • Mission: • To provide for the creation of “unique”, high quality data, services, and ancillary descriptions of data. • To design enabling authoring tools (with ability to tag entries) and robust scientific search software. [The ability to “tag” data at the time of writing is key. Tagging is more difficult when this ability is not integrated through tools and is less likely to be accurate and normalized.] Best to gather information when the “getting is good.”} • To assist the global community in the discovery of the scientific resources within the directory.
Does the GCMD Follow the DRM 2.0? Presentation Guide
Data Context Presentation Guide • Facilitates discovery of data through an approach to the categorization of data according to taxonomies. • Enables the definition of authoritative data assets within the COI, (using unique identifiers). • Provides linkages to data described, thereby managing the ‘info glut’, through: • Open API links, such as OPeNDAP (an open source framework that simplifies aspects of science data networking.) • Related_URL “controlled keyword” links to data. • New “use” metadata associated with detailed variables within the data sets. • ~ 20 Petabytes of data represented through the GCMD.
Ancillary Keywords Coming Soon: Orbit Types, Spectral/Frequency Domain, Launch Sites and 4 Level Taxonomy for Models.
Data Description: “How do we understand what data are available?” Presentation Guide • Provides a means to uniformly describe data - thereby supporting its discovery, harmonization, categorization, sharing, and rapid coordination/ communication. • GCMD uses the DIF “standard”. There are many advantages. • Descriptions must be identified UNIQUELY.
The Evolving DIF **International Interoperability Forum functions at the international level through CEOS. 1995 - Major steps in evolution through modification to a multilevel Earth science hierarchy: Category > Topic > Term > Variable > Detailed Variable Two important trends were emerging that would affect evolution: • FGDC and concept of “metadata” for geospatial and other data initiated. • Web taking shape. 1997 - DIF evolves from 23 to 34fields • Compatible with mandated FGDC and Dublin Core. • Era of metadata initiated. Other “standards” emerging: ANZLIC • Web expanding: Search interfaces abound; GCMD ready for this revolution. 1999 - DIF evolves to 35 fields in MD7. [3 added; 2 deleted] • DIF creation date and revision history added. • New field for paleoclimate data: paleo-temporal coverage. • Personnel subfields modified. • FGDC mandated, but DIF compatible with all required fields, serving users with added benefits of unique ID. Conversion tools available: FGDC=><=DIF 2002 - DIF acquires new sibling: the SERF, allowing cross linkages between services & data. • Redesign of query language; XML syntax; separation of presentation from business/application logic, with unexpected gifts: SOA architecture; querying multiple data sources for spatial, temporal, RDF and RDBMS databases, full-text; Struts facilitated creation of customized portals. • LDA experiment 2004 - MD9 ISO 19115 compliance and evolves to 36 fields: • 3 New fields added: new address; 2 data resolution subfields
Data Sharing Presentation Guide • Supports the access to data - enabled by capabilities provided by both the Data Context and Data Description standardization areas through: • 1. Ad-hoc requests (such as a query of a data asset) - • an OpenAPI supports ad hoc requests. Example: • OPeNDAP. • 2. Exchange of data (such as those that consist of fixed, reoccurring transactions among parties): • Examples: • GeoConnections (Canada) • OAI with NCAR and NOAA • Data centers that use docBuilder tools to submit • metadata descriptions.
Evolution: Project Development Drivers • Maintenance reduction • Improving the Discovery of and Access to Data and Services • Ease of use, such as web site navigation. • Accuracy of Results • Content Requirements • Quality control • Integration with metadata authoring tools that allow real-time updating by data set holders/producers. • Integrated keyword and free-text search, with both as “refinements”. • Bidirectional linkages between data sets and data set services. • Providing virtual subsets of the directory • Standards: ISO 19115/19139; OpenGIS; XML; RDF. • NASA needs • Science User Working Group Recommendations; user and partner requests. • Evolving coding languages and databases [e.g., C to Perl to Java].
MD History 94 95 96 97 98 99 00 01 02 03 04 MD MD4 04/96 10/96 MD5 MD4 05/00 MD6 08/00 MD7 MD7 MD7 10/99 12/00 • Switched code base from Perl to Java • XML syntax for metadata • OPS for managing metadata • Services Prototype Launched • First time the coordinators were able to load their own DIFs/SERFs • docBUILDER tools MD8 OPS • Upgraded Isite free-text • JAVA Applet for geospatial search and "Advanced" search interface • First web page to use Science keyword Topics to search • Parent/Child display • Related_URL field added • HCIL and Matrix interface • Science keyword hierarchy • FGDC Compatibility • Isite free-text search • X-Windows client • JAM client • First use of Oracle • DIFmacs Authoring Tool • Switched code base from C to Perl • Conference Calendar • Personnel "Role" field • Paleo_Temporal Coverage • 1st request in FGDC • MD8 • DIFbuilder tools • DIFmorph for translatingbetween FGDC and DIF • PC-based DIF Writing Tool • Transitioned space science DIFs to NSSDC • First web client distributed • DIFWEB tools 09/01 11/01 07/02 10,000 5,000 2,000 DIFs SERFs 500 200 MD2 MD2 10/94 04/97 MD5 MD6 04/98 Features MD8 OPS 06/01 MD8 5/04 08/03
MD History (MD 8 and beyond) MD8 MD9.1- 9.2 • Location and data center hierarchy • Increased number of characters for fields • Spatial and temporal resolution range keywords • docBUILDER tool personalized templates MD9.4 • Relative Temporal coverage added • to accommodate data pools • Added two level hierarchy for • Related_URL (e.g. support Get Data) • Lucene Search engine • Search term highlighted in records • Refinement search by keywords or • full text search • User Comment form • New Home Page • Portals • DTD for DIF and SERF • Open API • Spatial search with google map • Refinement option by data resolution • for NASA portals • Support foreign characters record display • Subscription service for science keywords • docBUILDER tools available for public MD9.3 • Struts • Compatible with ISO 19115 • metadata standard • Geographic coverage map • added to record 08/05 MD9.5 MD9.6 03 04 05 06 07 15,000 17,000 500 DIFs SERFs 1,000 1,200 MD8 08/03 MD9.1- 9.2 06/04 Features MD9.3 02/05 MD9.4 03/05 MD9.5 02/06 MD9.6 07/06
Following the “Hype” or Listening Too Intently to the “Wilderness Request”. The Hype: Distributed Systems >Check if application is appropriate for needs. >Determine its ROI. >Offered LDA, as “Local Database Agents” - not “Latent Dirichlet Allocation”. >Be vigilant for change. >Know when to cut losses. >Scope the future. The Out-In-The-Wilderness Request: Example: Offline Authoring Tool >Check longevity to assure usefulness when development complete. >Determine the ROI in advance. >Know when to cut your losses.
Finding OurselvesLiebhold (May 2005) O’Reilly Network “The Web doesn’t have a single, comprehensive clearinghouse where you can find all of the data and domains of knowledge covering all geographies …. Instead there are hundreds of … “Very few geospatial information scientists are working on the challenge beyond the GCMD (Global Change Master Directory), whose database holds more than 15,000 [actually this number is 17,300 +] descriptions of data sets and services covering all aspects of earth and environmental sciences.”
Internal View of GCMD Value (2007) • Controlled Keywords (& definitions) to reference and retrieve a record or sets of records. • Authoring Tools with Update Capability. {Heavy use of controlled keywords.} • Keyword & Full Text Search to Data and Services with ordered “Result Set”. (No need to build a client to query, although the option to do so is available through Open API.) • Customized Portals - virtual subsets of the directory, created through use of controlled keywords. • The “Get Data” feature, which takes the user directly to the data. • Unique data set and services entries. • Easy compliance to related standards through XML. • Results available through Google. • Well-designed home page, with access to full set of services provided.
MD Software Version 9.7 • Support for 2 additional levels of Science Keyword hierarchy. • Improved Features for docBUILDER Authoring Tools • Support for writing Platform and Instrument descriptions & new keywords • Support for “GET DATA” tab. • “Text Only” display for 508 compliance. (in docBuilder) • Improved multimedia sample. • Improved spatial coverage selection. • Ability to change entry identifier. • Reference Guide for use of international characters and symbols. • RSS Feed, in addition to Keyword Subscription Service, to signal new directory entries. • Upgrades to Java 1.5 and Tomcat 5. • Location Keywords & “Chronostratigraphic Units” recreated as true taxonomies.
MD Software Version 9.7 • Keyword Functionality Upgrade. • Functionality “abstracted” to use a SKOS data model for navigating arbitrary taxonomies. • Integrated SKOS query into query language. • Backed by Berkeley DB XML for querying. • Example: [skos:Parameters=‘EARTH SCIENCE|ATMOSPHERE’] AND [skos:Instrument=‘AVHRR’] • New Platform/Instrument Display Reflects Taxonomic Changes. • Support for loading, extracting, querying. • Support for navigating through new taxonomies. • Support for full text search. • Support for creating these descriptions in docBUILDER.
Page 1 of SERF
Page 2 of SERF
Page 1 of DIF
Page 2 of DIF
MD Software Version 9.8 • docBUILDER Enhancements • Option for public vs private view. • Automated reminders to metadata authors. • Initial testbed for multilingual capabilities using SKOS. • Variable Keyword extensions for “use” metadata. • Client to ECHO for metadata sharing using web services.
Data & Information & Knowledge Repository The Data Reference Model 3.0, Web 3.0 & SOAs dynamic static Data Resource Awareness Agent Language Logic Figure 3-1 DRM standardization Areas