170 likes | 250 Views
Emergent Semantics: Towards Self-Organizing Scientific Metadata. Bill Howe, David Maier Oregon Health and Science University.
E N D
Emergent Semantics: Towards Self-Organizing Scientific Metadata Bill Howe, David Maier Oregon Health and Science University
“The file ‘anim-sal_estuary_7.gif’ is a data product derived from the output of the ELCIRC simulation program run for the period January 8-15 2002. The image shows salinity (practical salinity units) in the estuary region of the domain. It’s actually an animation, where each frame is a horizontal slice 7 meters below the mean sea level. There are 96 frames, each representing 15 minutes.” program = ELCIRC simStart = 1/8/02 simEnd = 1/15/02 region = estuary variable = salinity timesteps = 96 plottype = animation Oregon Health and Science University
Environmental Observation and Forecasting System • Daily forecasts and 1000s of ad hoc hindcasts • One simulation involves ~20k files: • inputs, parameters, outputs, derived data products • This scale mandates: • query access rather than simple filesystem browsing • Automation everywhere Oregon Health and Science University
Tasks • Collect metadata. • Organize collected metadata. • Publish organized metadata for querying. Oregon Health and Science University
Challenges • Metadata is scattered • in file paths • within file headers • in “nearby” files • Metadata requirements change frequently • new simulation codes • new data product types • new users, internal and external Depth = “7” Variable = “Salinity” …/anim-sal_estuary_7.gif Type = “Animation” Region = “Estuary” Oregon Health and Science University
“Obvious” Solution • Data Managers work with Domain Experts • design a relational schema, load data, test, repeat file • But: • Large up-front cost to DB design • Slow return on investment • Use cases unknown • Significant change is anticipated • DB languages/APIs not necessarily within scientists’ skill set data product region Oregon Health and Science University
Alternative Solution: Steps 1-3 • Harvest metadata via simple collection scripts written by the domain experts • Use RDF as a schema-independent metadata representation • Use RDBMS technology for storage and management 1. Collection scripts filesystem 3. db 2. rdf Oregon Health and Science University
A Narrower Interface SQL statements Database APIs Load Strategies Data formats/models rich schema filesystem Collection scripts generic schema filesystem RDF triples Oregon Health and Science University
Generic RDF Schema Oregon Health and Science University
Is Generic RDF Good Enough? “Find files with region, plottype, and variable descriptors” SELECT r.subject as file, r.object as region, p.object as plottype, v.object as variable FROM statements r, statements p, statements v WHERE r.subject = p.subject AND p.subject = v.subject AND r.property = ‘property:region’ AND p.property = ‘property:plottype’ AND v.property = ‘property:variable’ 3 self-joins! Oregon Health and Science University
Decomposed Data • So we can query the RDF directly, but… • …no grouping structures to aid query formulation and processing. • Automatically infer groupings from the RDF data, observing that related files often share signatures. • Let users impose groupings using a web interface (like views) db ... <isofar.gif, type, isoline>, <isofar.gif, region, far>, <animsal.gif, timesteps, 10>, <animsal.gif, var, salt>, ... filesystem plot animation Oregon Health and Science University
Alternative Solution: Steps 4-6 • Partition descriptors into equivalence classes based on file signatures • Expose signatures via the web to facilitate browsing and querying • Recompute signature extents as new metadata is integrated 4. partition data 5. publish to the web db website 6. query and browse via profiles Oregon Health and Science University
The set of properties defined for a particular file Oregon Health and Science University
Signatures • A file’s signature is just the set of properties used to describe it. • If signatures were fixed, we might derive a relational schema from them. Instead, we need to respond to changes 4. partition data db find signatures compute signature extents Oregon Health and Science University
Example: Consolidate Files with Similar Signatures • Modify schema (DM) • Transfer tuples from A to B (DM) • Modify collection programs • Modify extraction routines (DE) • Modify Internal organization (DE) • Modify SQL statements (DM) Oregon Health and Science University
Alternative • Change two lines in a collection script (DE) Assert(fileA, “animation”, “”) Assert(fileA, “plottype”, “animation”) Assert(fileB, “plottype”, “animation”) • Reload data (Automatic) • Recompute Signatures (Automatic) • Republish data (Automatic) Oregon Health and Science University
Benefits • Narrow interface between data creators and data managers • Metadata exploitable prior to finalizing a thorough schema • Derived schema can adapt to changing requirements automatically • Profiles constitute emergent semantics: meaning is assigned after data is collected. Oregon Health and Science University