290 likes | 445 Views
Metadata: why and how for social science. Louise Corti Online Resources Day 15 November 2005, London. What Do Social Researchers Want?. Discover available datasets (globally, not just in their own country) and related research literature
E N D
Metadata: why and how for social science Louise Corti Online Resources Day 15 November 2005, London
What Do Social Researchers Want? • Discover available datasets (globally, not just in their own country) and related research literature • Understand in detail the origin, methodology and structure of datasets (social sciences datasets are modest in size but big in complexity) • Compare and Link data from different sources • Model the social phenomena underlying the data • Publish their findings with all the supporting evidence (no ‘iceberg’ publishing) and Reproduce published results • Connect to other experts and Share informal comments and advice • Enforce confidentiality and intellectual property rights while mantaining accuracy and access to data sources. • … and more
How? • through rich and systematic description – though a language that humans and computers can both understand • using commonly agreed or mappable vocabularies and standards • which must be flexible and adaptable • metadata
What are metadata? Metadata are structured data which describe the characteristics of an object or resource. They share many similar characteristics to the cataloguing that takes place in libraries, museums and archives. The term "meta" derives from the Greek word denoting a nature of a higher order or more fundamental kind. A metadata record typically consists of a number of pre-defined elements representing specific attributes of a resource, and each element can have one or more values.
Metadata schema Element nameValue • Title Web UKDA Catalogue • Creator Louise Corti • Publisher UK Data Archive • Identifier http://www.data-archive.ac.uk/ • Format Text/html • Relation Data Archive Web site Each metadata schema will usually have the following characteristics: • a limited number of elements • the name of each element • the meaning of each element
International standards for metadata schema • to ensure that every element of information pertaining to the lifecycle of an object ( collection) can be captured: • creation, appraisal, accessioning, conservation, preservation, availability and access • must be dynamic and must be open to amendment • aim to be consistent, appropriate and self-explanatory description • facilitate the retrieval and exchange of information • enable the sharing of authority data • enable the integration of descriptions from different locations into a unified information system
Common metadata schemas Dublin Core minimum number of elements required to facilitate the discovery of document-like objects in a networked environment (eg Internet). Currently 15: Content: Title, Subject, Description, Source, Language, Relation, Coverage Intellectual Property: Author/Creator, Publisher, Contributor, Rights Electronic/Physical Manifestation: Date,Type, Format, Identifier ISAD(G) General International Standard of Archival Description E-GIF E-Government Interoperability Framework OAIS Open Archival Information Systems Reference Model OAI Open Archives Initiative Protocol for Metadata Harvesting
No shortage of statistical metadata standards • The Common Warehouse Metamodel (CWM) from OMG – data warehousing and business intelligence • ISO 11179 – data elements in a metadata repository • SDMX – multidimensional data and time-series • IQML, AskXML and Triple-S - questionnaire data • The Data Documentation Initiative (DDI) – a general metadata standard for statistical data (micro as well as aggregated) • And many other related standards. e-Social Science requires more than simple ”data” metadata: • Thesauri, Classifications
Encoding schemes • HTML (Hyper-Text Markup Language in Web pages, version 3.2 or 4.0) • SGML (Standard Generalised Markup Language) • XML (eXtensible Markup Language) • RDF (Resource Description Framework) • MARC (MAchine Readable Cataloging) • MIME (Multipurpose Internet Mail Extensions) • Z39.50 (protocol for distributed information retrieval) • LDAP (Lightweight Directory Application Protocol)
Example of deploying metadata for a simple web resource • embedding the metadata in a Web page by the creator using META tags in the HTML coding of the page • as a separate document (eg XML) linked to a web resource it describes • in a database linked to the web resource. The records may either have been directly created within the database or extracted from another source, such as Web pages • but what about complex social science data?
Stepping back:The Standard Study Description • devised in 1970s to describe academically created sociological/political science datasets • recommended key bibliographic elements • informally ‘adopted’ by CESSDA in 1980s • often adapted to suit local needs
The Standard Study Descriptionrecommended elements: • subject category • title • depositor • principal investigator • abstract and main topics • kind of data • dimensions of dataset • universe sampled • sampling procedures • method of data collection • dates of coverage, fieldwork and deposit • availability and access conditions • references to reports and related datasets Controlled vocabulary • adopted for some elements • e.g sampling, kind of data • subject and geographical key words from broad social science Thesaurus (HASSET)
The first step towards interoperability • driven by the need to search across European Data Archive holdings • development of a core element set for the Integrated Data Catalogue (IDC) • catalogue records marked with standard tags for inclusion into WAIS indexes (Wide Area Information Servers) • enabled multi-site searching via WAIS protocol • simplistic and excluded - links to additional metadata, documentation, thesaurus help, and browsing
the DDI is widely adopted by social sciences data archives all over the world that provide many of the datasets used by social scientists for secondary analysis • initiated and organised by the the Inter-University Consortium for Political and Social Research (USA) in 1995 to create a metadata standard for the social science community • members coming from social science data archives and libraries in USA, Canada and Europe and from major producers of statistical data • first in SGML then in XML • DDI 1.0 published in 2000. Currently at version 2. Version 3 is being designed and it is scheduled for 2006
The Structure of a DDI Codebook • Document Description • Description of the codebook document itself (author, sources, etc) • Study Description • Information about the entire study or data collection (content, collection methods, processing, sources, access conditions etc) • File Description • Description of each single file of the data collection (formats, dimensions, processing information, etc.). • Data Description • Description of each single variable in a datafile (format, variable and value labels, definitions, question texts, imputations etc.) • Other Study-related Materials • References to reports and publications and other machine readable documentation
000001 1 1 44 123 9 5 4 5 000002 1 3 47 003 1 3 3 3 000003 2 5 43 155 1 1 2 3 000004 1 3 36 012 2 5 5 5 000005 9 4 24 207 9 1 4 5 Data description - variables Country Ocuupation CaseNumber Sex Age QuestionResponses
Understanding Statistical Metadata Different approaches to understanding: • what is it for? • statistical metadata has no value in itself, it is just a means to an end. Its progress should be measured by the extent that it facilitates social research • what is it like? • Anything familiar we can relate it to? Form of communication might be a good choice
Benefits • interoperability • homogeneous exchangeable documents • richer content • comprehensive set of elements providing the potential data analyst with broader knowledge • single document - multiple purposes • repurposed for different needs and applications – preservation, discovery, and dissemination • on-line subsetting and analysis • standard uniform structure and content for variables, ensures easy import into on-line analysis systemsp • precision in searching • field-specific searches across documents are enabled • and more … • human-readable and computer actionable • essential foundation for E-science and the Grid
EU Madiera Portal Search Multilingual Browsing Meta(data) Browsing
Summary - the DDI • The DDI can serve as the foundation for content, distribution, use and preservation of data collections in the social and behavioural sciences, across institutions, countries, and disciplines • cooperation from both data producers and statistical software manufacturers, so that the DDI specification can readily become the basis for the entire research process, from generation of a data collection instrument to production of research articles • serves the social science community well with a specification that produces quality metadata with multiple purposes. It fully documents the details of datasets, it is user friendly and accessible, it integrates into the infrastructure of the Web and it supports automatic generation of statistical software system files. • the widespread adoption of the DDI will vastly improve access to a range of varied datasets. Expanded use will greatly enhance comparative research; the ability to harmonize datasets over time and geography will lead to significant improvement in our understanding of societies
The future Statistical metadata is here and it is already changing the way people locate and make sense of data but it does not yet support most use cases of interest to social scientist. What we will need to move forward is: • Grammar, a standard Semantic infrastructure (e.g. as provided by the Semantic Web): • semantic extendibility • ability of integrating (merging and overriding) descriptions from different sources • large Vocabulary, by integrating different flavours of metadata: • unique identifiers for data and research literature • statistical data metadata (full life cycle) • Ontologies, Thesauri and Classifications (and mappings among them) • statistical processing metadata • “Secondary metadata”: annotations, quality assessment, links to research literature • experts metadata (FOAF)
Future developments: • Progress in metadata and technical standardisation • Latent knowledge capture and extraction Not Even Half Way There .. Annotations Comparable variables Unified Authentication Integrated Data Catalogue Nesstar – Data Web Grid Mappings References Extraction Cooperative Markup ELSST DDI Standard RDF Semantic Web USI TEI for QD
Qualitative data and the DDI • in October 2001 ESDS Qualidata formally adopted the DDI to describe data • in 2000, began to explore standards for archiving, and web representation of qualitative data • expertise from the text processing/arts and humanities communities - TEI • ESDS Qualidata Online show basic potential of what can be achieved by a common standard • need to catch up with the statistical community! • working model that will presented today