1 / 34

Adaptable and Incremental Metadata Capture in e-Science

Adaptable and Incremental Metadata Capture in e-Science. Scott Jensen Data to Insight Center Indiana University. University of Chicago – March 2, 2012. What is Metadata?. Data About Data

napua
Download Presentation

Adaptable and Incremental Metadata Capture in e-Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptable and Incremental Metadata Capture in e-Science Scott Jensen Data to Insight Center Indiana University University of Chicago – March 2, 2012

  2. What is Metadata? Data About Data • “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage any other resource” National Information Standards Organization • Alternately, answers the who, what, when, and why questions about a dataset. ISO 19115 standard • Where (spatial metadata) • How (configuration)

  3. Why Does Metadata Matter? • Data Reuse • “Metadata is key to being able to share results” U.K. e-Science Core Programme • “A significant need exists in many disciplines for long-term, distributed, and stable data and metadata repositories” NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure • “Preservation of digital data is arguably a ‘grand challenge’ of the information age” Francine Berman • Trusting and Understanding Data • The ability to understand and evaluate the quality of data is key to reuse after discovery. If they have too much uncertainty, they would not use it. Ann Zimmermann • Data that is Costly and Irreplaceable • Can other data be regenerated? • Data Management Plans

  4. Metadata Capture • Historically done at the end of the data lifecycle • Research is completed • Data and results tarred up as a dataset • metadata at the dataset level • Inserts are full metadata documents • Metadata often captured at the collection level • Generalized and not specific to each data product • Collection level metadata for discovery (e.g., WCS) • Detailed metadata stored as an object • Data search is coarse • Based on keywords or text search • Spatial bounding box and temporal range • Not specific to a data product , details not searchable • Sometimes just browse capabilities

  5. How Much Metadata to Capture? Lower Barriers to Entry Structured Metadata Schemata (FGDC, EML) Core Metadata Flat Schemata (unqualified DC) More Structure Less Structure Name / Value Pairs Richer Metadata to Search Over Cost / Benefit Trade-offs

  6. Research Problem • Early Capture of Ephemeral Metadata • Incremental, not at the end of the lifecycle • Incremental capture must be efficient • Deluge, Tsunami, Bonanza • Requires automation • Detailed metadata for discovery • Scalability • Variable and Dynamic Data • Must accommodate new metadata • Accommodate different domains and schemata

  7. Research Focus • Identified the concept based character of scientific metadata schemas that differentiates them as a class from other XML schemas. • Capture metadata incrementally and efficiently early in the scientific process • Capture detailed metadata without full update • Reconstruct metadata on-the-fly after incremental capture • automated metadata extraction from data objects • Incremental capture must be efficient and scalable • Architecture must generalize across schemas and domains • Detailed metadata must be discoverable • Extensible without schema modifications

  8. Metadata Schemas - a Bag of Concepts FGDC Spatial Schema ISO 19115 • Identification • Constraints • Data Quality • Spatial • Reference System • Distribution • Metadata Extension • and more … • Astronomy • Identity • Curation • Content • Coverage • Spatial • Temporal • Data Quality • Ecology (EML) • General • Geographic • Temporal • Taxonomic • Methods • Data table metadata • DDI (version 2.0) • Description • Study description • Physical file description • Logical description (variables) • other

  9. Concepts have Complex Structure • Schemata are often composed of complex concepts (compound elements) • “Compound elements represent higher-level concepts that cannot be represented by an individual data element” • Increased structure → Increased reusability • Flat schema → difficulty harvesting • Harvesting Dublin Core led to incomplete and inconsistent data - California Digital Libraries • Similar issues at the National Science Digital Library made it difficult to build services on harvested Dublin Core. • Performance bottleneck when converting XML to name/value pairs

  10. Concepts & Incremental Metadata Capture As an experiment runs, adding a concepts does not require editing the existing metadata. Can capture ephemeral metadata such as workflow notifications and add them to a detailed metadata document. Metadata can be harvested from files and added as queryable metadata at different levels of the hierarchy.

  11. Partitioning a Schema on Concepts Global ordering of concept elements and higher levels 3 2 6 5 1 7 Concept Requirements: Recursion is within concept Elements where cardinality can exceed one are concepts or contained in concepts 12 13 16 Beneficial when CRUD operations are at the concept level or higher Incremental ingest – no need to modify existing concepts. Efficient reconstruction based on concept-sized fragments

  12. ID Name Global Order Name Concept Source CLOB Source Metadata Document Concept Sub-concept * … Typed Value Shredded Concept Concept Element * Shredding XML Concepts • Metadata documents are “shredded” into concepts and then concepts are shredded into elements using XSLT. • Once CLOBs are stored, metadata cannot be lost. • CLOBs are indexed on Object ID and their global ordering. • Shredded metadata is only a search index, allowing for strong typing – even if types do not match XML. Fast Response Detailed Search

  13. Ingest & Search Using Incremental Capture XMC Cat Database Determine Schema Concept CLOBs Shredded Concepts Shred Validate new concept Build Query Build Result search based on concepts query shredded metadata for matching objects object IDs query for CLOBs based on IDs

  14. Exploded Datasets(Describing data in a broader context) Incremental Capture During a Workflow or Experiment Not a tarball at the end of a project Automated capture during an experiment Data objects are generated throughout a workflow Experiment data hierarchies vary by domain Provides scientists access to incremental metadata

  15. Automated Metadata Capture

  16. Domain Schema → Generalized Architecture

  17. Adaptable Metadata Store Shares characteristics of clinical genomics databases and relational RDF stores such as Jena. Definition of concepts is based on schema structure. Dynamic concepts can be defined based on metadata content instead of structure. Every concept is stored as a CLOB Concepts can optionally be parsed into concepts, sub-concepts, and elements.

  18. mapped to A Generic Structure for SearchingDomain Concepts Metadata Schema : Concept + Concept : Sub-Concept *, Atomic Element * Sub-Concept : Sub-Concept *, Atomic Element * Atomic Element : date | time | timestamp | integer | float | spatial | string Complex Domain-Specific Concepts Generalized Concepts, Sub-Concepts and Elements

  19. Shredding Domain Metadata <lead:LEADresourcexmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:le="http://schemas.leadproject.org/2007/01/lms/leadelements" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc"> <le:resourceID>urn:uuid:97afbef7-58c8-4143-9b05-f0b9d82d27ef</le:resourceID> <lead:data> <lead:idinfo> <lead:citation> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation> <lead:descript> <fgdc:abstract>Real-time meteorological data assimilations with CONUS coverage at 10km resolution produced hourly by CAPS at OU. The List of contents provides the OPeNDAP URLs for the files within the collection. They have a form: http://lead.unidata.ucar.edu/cgi-bin/nph-dods/test-data/ADAS/OU/ad{date}.nc where {date} has the form: YYYYMMDDHH and indicates the hour for which the data assimilation is valid. </fgdc:abstract> <fgdc:purpose>Scientific research and education</fgdc:purpose> </lead:descript> . . . <lead:keywords> <fgdc:theme> <fgdc:themekt>DatasetTypes.lead.org</fgdc:themekt> <fgdc:themekey>ADAS</fgdc:themekey> </fgdc:theme> <fgdc:theme> <fgdc:themekt>CF-1.0</fgdc:themekt> <fgdc:themekey>projection_x_coordinate</fgdc:themekey> <fgdc:themekey>projection_y_coordinate</fgdc:themekey> <fgdc:themekey>height</fgdc:themekey> <fgdc:themekey>geopotential_height</fgdc:themekey> <lead:LEADresourcexmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:le="http://schemas.leadproject.org/2007/01/lms/leadelements" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc"> <le:resourceID>urn:uuid:97afbef7-58c8-4143-9b05-f0b9d82d27ef</le:resourceID> <lead:data> <lead:idinfo> <lead:citation> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation> Citation Concept Description Concept 2nd Theme Keyword Concept

  20. Shredded Citation Metadata All Shredded Metadata Conforms to the Same Schema <objectClobPropertymyPos=“5" (namespaces omitted here) > <objectClob> <lead:citationxmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation> </objectClob> <objectPropertymyName="citation" mySource="LEAD"> <objectPropertymyName="pubInfo" mySource="LEAD"> <objectElementmyName="pubPlace" mySource="LEAD" myVal="unknown"/> <objectElementmyName="publisher" mySource="LEAD" myVal="IU/GEOG"/> </objectProperty> <objectElementmyName="originator" mySource="LEAD" myVal="/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson"/> <objectElementmyName="pubDate" mySource="LEAD" myVal="Unknown"/> <objectElementmyName="pubDateTime" mySource="LEAD" myVal="Unknown"/> <objectElementmyName="title" mySource="LEAD" myVal="LEAD CONUS ADAS Catalog/CONUS ADAS 10km"/> </objectProperty> </objectClobProperty> <objectPropertymyName="citation" mySource="LEAD"> <objectPropertymyName="pubInfo" mySource="LEAD"> <objectElementmyName="pubPlace" mySource="LEAD" myVal="unknown"/> <objectElementmyName="publisher" mySource="LEAD" myVal="IU/GEOG"/> </objectProperty> <objectElementmyName="originator" mySource="LEAD" myVal="/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson"/> <objectElementmyName="pubDate" mySource="LEAD" myVal="Unknown"/> <objectElementmyName="pubDateTime" mySource="LEAD" myVal="Unknown"/> <objectElementmyName="title" mySource="LEAD" myVal="LEAD CONUS ADAS Catalog/CONUS ADAS 10km"/> </objectProperty> </objectClobProperty> CLOB for Citation Concept pubInfo Sub-concept

  21. Dynamic Concepts Based on Content CLOB parsed out and saved based on global order (schema structure) Concept defined based on “entity” label and source 1 12 13 Sub-concept and elements defined based on “attribute” label and source New domain concepts without schema changes Concept CLOBs are always saved based on global order – even if concept is not defined. To be queryable, new concepts and elements defined, but no schema change is required

  22. XMC Cat Builder: Concepts

  23. Deployed in Diverse Domains • Linked Environments for Atmospheric Discovery (LEAD) • NSF funded science gateway • Metadata describing 500TB of data, intermediate results, and workflow output • Data objects each described by up to 2,202 elements • Individual workspaces of up to 15,000 objects • One Degree Imager (ODI) WIYN Consortium • Component in the data subsystem • Data-driven workflows • SEAD Project • Sustainability science • Provide search capability over archived use metadata

  24. Comparing to a Native XML Database Concurrent Insert/Query Execution Time in Milliseconds • Except for queries based on object IDs, XMC Cat at 8Xthe base workload performs better than Berkeley XML at 1/10thof the base workload. • XMC Cat experiment inserts include validation not reflected in Berkeley results, eliminating validation, XMC Cat at 8X the workload is 2,477 ms. Projected insert and query workload as multiples of projected LEAD workload based on LEAD technical report and insert/query ratios of the TPC-E benchmark. Scott Jensen, DevarshiGhoshal, and Beth Plale, Evaluation of Two XML Storage Approaches for Scientific Metadata Indiana University CS Technical Report TR698, October 2011.

  25. Performance Compared to Inlining Scott Jensen and Beth Plale, Using Characteristics of Computational Science Schemas for Workflow Metadata Management, In Proceedings of the 2008 IEEE Congress on Services, IEEE 2008 Second International Workshop on Scientific Workflows (SWF 2008), Hawaii, July 2008.

  26. Eventual ConsistencyBrowse versus Search Metadata Scott Jensen and Beth Plale, Trading Consistency for Scalability in Scientific Metadata, In Proceedings of the 2010 IEEE International Conference on e-Science, Brisbane, Australia, December 2010.

  27. Bounds on Eventual Consistency ECt = Wt + Tt + Rt + St + It Above times are averages for fetching a batch of 100 concepts (Tt and Rt) and then processing each concept (St and It). Total wait time is dominated by Wt. If the distributed shredders keep pace with the ingest rate, the frequency of the shredders fetching determines Wt

  28. Evaluation of Eventual Consistency strict consistency is 42% longer at 6X the base workload • Eventual consistency scales higher • Strict consistency scaled to 8X the projected workload • Mostly due to deferred shredding • Using two eventually consistent shredders on a separate server

  29. Domain-Adaptable Metadata Search Metadata search criteria are often limited keywords or text, spatial bounding box, and temporal bounds. If rich metadata is captured as a BLOB, it is available as use metadata, but not discovery metadata. Instead … Use domain concepts and dynamic concepts to define search criteria. Generic architecture for shredded metadata -> search criteria can include any shredded domain metadata.

  30. Dynamic Search Definition

  31. Search Adjusts to Domain Concepts When the target is selected: all concepts are listed as search options – grouped by their categories When a concept is selected, all of its sub-concepts and elements are listed as options

  32. Strongly Typed Search Criteria

  33. Current Work Simulation Forecast Census Data Sensor Data Ecological Data Satellite Data • Handle hierarchies based on multiple schema • Experiments bringing together data from multiple sources described by different standards. • Data described by different metadata standards can be combined in a single dataset. • Metadata can be queried based on different schemas. • Faceted search • Added to XMC Cat web service. • Can alternate between facets and details. • Unified criteria for multiple schema.

  34. Thank You! Scott Jensen scjensen@cs.indiana.edu Thanks also to: - The NSF-funded Linked Environments for Atmospheric Discovery (LEAD) project - Data to Insight Center

More Related