Metadata challenges: providing stronger assessments of data quality

Metadata challenges: providing stronger assessments of data quality Dr Lex Comber ajc36@le.ac.uk

Acknowledgements • The ideas in this presentation are the result of an ongoing collaboration • Mark Gahegan • This is a work in progress...

Aims • To expand on current notions of metadata for spatial data • To explore metadata objectives, content and roles • To consider possible metadata developments • To propose an agenda for evolving metadata Statement • “Data quality can only be determined in light of its intended use: quality is not absolute is relative to its use” • Data is frequently (mostly?) used purposes other than its original use • 3rd party data; more users; greater access egSDIs, INSPIRE, GRID etc • Users need to understand the uncertainties when they use the data • A dataset will have different ‘quality’ for different users (and uses)

Outline • Introduction • Spatial data variability: semantics, measurement & abstraction • Examples • Context • Users, prototypes & semiotic triangles • Standards • A research agenda for more nuanced metadata

Introduction: spatial data variability • Many different ways of conceptualising the world • Grounded in semantics and meaning • Different meanings and understandings • Sometimes called an ‘ontology’ • Geographic representation • Real world infinitely complex • Representation involves • Abstraction, Aggregation, Simplification etc • Examples

Example: UN FRA Grainger, A (2007). The influence of end-users on the temporal consistency of an international statistical process: the case of tropical Forest Statistics. Journal of Official Statistics, 23(4): 553-592

Spatial characterization can change

Example: sea level Differences in sea level (cm) Fact: A bridge collapsed ! Where: Laufenburg on the river Rhine Why: The already completed bridge on the Swiss side has a difference in altitude (level) of 0,54 meters compared to the German counterpart How: The two neighbouring countries use varying (different) measuring methods Source: http://www.laufenburg.ch

Example: what is a forest?

Example: what is a forest? From Comber, A.J., Fisher, P.F., Wadsworth, R.A., (2005). What is land cover? Environment and Planning B: Planning and Design, 32:199-209 Does not include species, area, strip width 16 Zimbabwe 14 12 10 Sudan 8 Turkey Tanzania Tree Height (m) Mozambique Morocco Ethiopia United Nations -FRA 2000 New Zealand 6 Denmark PNG Luxembourg Netherlands SADC Namibia Malaysia Cambodia Belgium UNESCO Jamaica Australia Somalia Japan 4 Israel United States Gambia Switzerland South Africa Mexico 2 Kyrgyzstan Kenya Portugal Estonia 0 0 10 20 30 40 50 60 70 80 90 Canopy Cover (%) • Data source: http://home.comcast.net/~gyde/DEFpaper.htm

Introduction: spatial data variability • Much variation representation of the world • Choices about representation vary depending on • Commissioning, scientific & policy context (who paid for it?) • Observer (what did you see?) • Institution (why you see it that way?) • Measurement (how did you record it?) • So… almost everything in Geography is a matter of interpretation • The same processes may be recorded (represented) in different ways → Variation in representation & concepts

Context • Now: many more users of spatial data • Obtaining data is easy and quick • Web, INSPIRE, SDIs (click through download) • No gatekeeper, no negotiation • Users may assume that data about ‘forest’ or ‘height above sea level’ etc matches their concept, their understanding • Prototypes in cognitive science

Context • Semiotic triangle • Real world • GI conceptualisations • User prototypes • GI is interpreted from personal & group conceptualizations of the world ‘real world’ measurement User GI • Geographical data are mapped into those conceptualizations • Then provided to users

Context • How does the user • Understand the data-to-real world link? • Avoid mis-matches with their Prototype, Conceptual model, Analytical objectives, or Existing data? • Determine data quality? • Ensure robust analysis? ‘real world’ analysis measurement User GI metadata • Users might expect metadata to support their activity… • The meta-descriptions of metadata in standards support that view…

Context • Geo-spatial data quality and metadata standards: • Positional Accuracy, Attribute Accuracy, Lineage, Logical consistency, Completeness • In many early standards: DCDSTF, 1988; FGDC, 1998; ANZLIC, 2001; ISO, 2003, OGC, INSPIRE • Distilled into the Dublin Core • Dublin Core Metadata Elements Set identifies 15 components • Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, Type • Relate to mainstream information sources • Books, web pages • Based on IP, cataloguing, retrieval & discovery • How to document information

Context • Metadata objectives: • “Data about data or a service. Metadata is the documentation of data. In human-readable form, it has primarily been used as information to enable the manager or user to understand, compare and interchange the content of the described data set” (ISO, 2003a) • BUT standards reflect the process of data production • Lineage from methods and data sources • Accuracy, Consistency and Completeness from assessment of results • Little focus on use • Little focus on assessments of data quality

Context • Currently metadata does NOT close the semiotic loop • In part this is the nature of standards... ... in theory provide a common language ... But their specification (content) is always a compromise and lags behind research & practice • E.g. a recent book on spatial data standards took 10 years from inception to being published.

Research Agenda • Can users make sense of the metadata provided? • Does it meet their needs? • Are the various MD fields relevantin this new context? • Are there important omissions? • Are there opportunities for further richness provided by recent innovations in information science? • Will data producerswill be able to keep up with metadata production at ever-increasing data rates? • In short: does metadata need to be re-envisioned for these new technologies and use-cases?

Research Agenda to support user evaluations of data quality 1. Metadata for what purpose, what roles? • Currently based on Archive, Discovery, Citations and Browsing. • Is this complete? • What about data quality assessments? Semantics ? 2. Metadata for what kinds of resources (not just data)? • Just datasets? Too shortsighted? What about: • Methods? Workflows? Research Questions? Researchers? • There are syntactic and semantic issues for each of the above: e.g. Methods can be described by syntactic signatures but that does not describe what they do to the user… 3. Actionable metadata? • Today’s information systems are poor consumers of metadata… • Do the tools we use make effective use of metadata? • Eg the GIS community have spent much time and effort on uncertainty metadata, even though the systems cannot analyze and propagate uncertainty during analysis

Research Agenda to support user evaluations of data quality 4. Does the role of Standards need to change? • Many metadata standards, and for a variety of purposes. • re-invented by different disciplines/groups • Who gets to make the standards? • Should standards to cover all metadata needs for science communities? 5. Cost and time for creating metadata standards? • How long does it take (examples from EU and ISO)? • What is the typical cost? • What does the metadata standards development process look like? • Do communities always accept them? 6. The burden of metadata production? • Often an ‘unfunded mandate’ • Documenting standards ignored to various degrees • Are metadata standards failing? (e.g. NSDI) • Are we sure we are collecting information that is useful?

Research Agenda to support user evaluations of data quality 7. Conveying understanding: Capturing and representing domain semantics? • There are many realities…each user of some resource brings a different understanding and potentially different metadata needs • Representing data semantics: (i) for users, (ii) for foreign systems • Using meta-models, where some domain semantics are first defined, then used to construct information schemas (e.g. NADM: North American Data Model for Geological Mapping) • Using ontologies for knowledge domains and tasks (e.g. NASA’s SWEET ontology of Earth processes and regions) 8. Mining situational metadata from use-cases (provenance)? • User ranking and feedback: • What works? What is missing? What is known? What is unknown? • Use-case logging: monitor use via a web portal / library, warehouse… • Use counts by web domains: differentiate user communities • Use-case mining and analysis • Discover significant usage patterns, use these to infer relevance, e.g. recommender systems, • Genesis, derivation, workflows • By exposing, analyzing and documenting the means by which the dataset was produced

Research Agenda to support user evaluations of data quality 9. Mining semantic metadata from resources and schemas? • Ontology mining • inferred from schema (metadata) - mappings built from exposed data schema • inferred from data in some cases - schema and data to construct ontology 10. Evolving metadata? • the way we describe the world keeps changing • …and we learn more about how things are used • The way we think about metadata now has evolved considerably over the last 20 years • we should expect that to continue. • Metadata schemas need to be designed for expansion and replacement as science evolves. • Meta-models help a lot, but are they flexible enough? Will emergent use patterns lead to new insights?

Final remarks • Assertion 1: Current attempts to gather and utilize metadata for data quality assessments are failing... • Assertion 2: The burden of tagging existing and future data with user-relevant metadata to do this is overwhelming • We cannot realistically expect data producers to carry this burden alone • Many different approaches to metadata creation are open to us • Some are new, facilitated by ‘grid’ and web service ‘brokered’ access to e-resources • We need to try some of these on a large scale. • These research ideas are intended to augment the ongoing work of INSPIRE / ESDIN, etc (not a critique) • The stakes are high: our success in sharing data – of which data quality assessments are a key part – will have big repercussions for research and policy for years to come

Metadata challenges: providing stronger assessments of data quality