E N D
1. Data Documentation Initiative: A global phenomenon – coming soon to ABS! Wendy Thomas
Chair, DDI Technical Implementation Committee
1 September 2009
2. Acknowledgments Slides provided for use by:
Wendy Thomas
Pascal Heus
Mary Vardigan
Peter Granda
Nancy McGovern
Jeremy Iverson
Dan Smith
3. What is Metadata? Common definition: Data about Data Provide descriptive information about of an object or concept
Properties, characteristics (in XML: elements and attributes)
It does not alter the content or nature of the object
It can be carried around without having to share the underlying object: catalogs, cars, libraries, etc.
It is usually public domain (important for sensitive data)Provide descriptive information about of an object or concept
Properties, characteristics (in XML: elements and attributes)
It does not alter the content or nature of the object
It can be carried around without having to share the underlying object: catalogs, cars, libraries, etc.
It is usually public domain (important for sensitive data)
4. Managing data and metadata is challenging! Many actors & communities with different needs and perspectives
Users: want open access to high quality and well documented data. Need discovery tools.
Public sector, private sector, academics
Producers: prepare the data and need to comply with privacy laws
Data Archives: need to interface with both communities
Policy Makers: need data to measure results and impact and to plan ahead
Sponsors: want to support the most relevant data collection
Public and Media: want access to simple, easy to understand statistics
Solving Information management issues is what ICT & XML are forMany actors & communities with different needs and perspectives
Users: want open access to high quality and well documented data. Need discovery tools.
Public sector, private sector, academics
Producers: prepare the data and need to comply with privacy laws
Data Archives: need to interface with both communities
Policy Makers: need data to measure results and impact and to plan ahead
Sponsors: want to support the most relevant data collection
Public and Media: want access to simple, easy to understand statistics
Solving Information management issues is what ICT & XML are for
5. Summary of ABS Metadata Management Principles Life-cycle focus
Data supported by accessible metadata
Metadata available and useable in context of client’s need
Registration authority for metadata element
Clear identification, ownership, approval status of metadata elements Describe metadata flow
Reuse metadata
Capture at source
Capture derivable metadata automatically
Ensure cost/benefit of metadata
Variations from standards tightly documented
Make metadata active to the greatest possible extent
6. NISO: A FRAMEWORK OF GUIDANCE FOR BUILDING GOOD DIGITAL COLLECTIONS Nice community document worth adoptingNice community document worth adopting
7. Some major XML metadata specifications for data content management Statistical Data and Metadata Exchange (SDMX)
Macrodata, time series, indicators, registries
http://www.sdmx.org
Data Documentation Initiative (DDI)
Microdata (surveys, studies), aggregate, administrative data
http://www.ddialliance.org
ISO/IEC 11179
Semantic modeling, concepts, registries
http://metadata-standards.org/11179/
ISO 19115
Geography
http://www.isotc211.org/
Dublin Core
General resources (documentation, images, multimedia)
http://www.dublincore.org This is a set of specifications for socio-economic data
When it comes to implementation, these are complemented with commonly used ICT specifications such as the XML family of recommendations, SOAP, OASIS WS-* security specifications, SVG, etc.This is a set of specifications for socio-economic data
When it comes to implementation, these are complemented with commonly used ICT specifications such as the XML family of recommendations, SOAP, OASIS WS-* security specifications, SVG, etc.
8. Metadata provides support for: Survey and data collection preparation
Data collection
Data processing
Analysis
Data discovery and access
Replication
Repurposing (secondary data use or data products)
9. Metadata Metadata is essential information for research and reuse of data
The further data gets from its source, the greater the importance of the metadata
Content is critical
Structure is becoming increasingly important in a networked world
10. Why Standards? Standards provide structure for:
Accurate transfer of content between systems
Increased automation of ingest, reducing costs
Interoperability between systems and software
Structural base for discovery and comparison
11. Example: Dublin Core Print card catalogs
Standalone databases
WorldCat and Google Static
stationary
Proprietary structure
Little cross-site searching
Standardized content
Cross-site searching
12. Interacting Standards for Data Dublin Core
ISO/IEC 11179
ISO 19115 – Geography
Statistical Packages
METS
PREMIS
SDMX
DDI Citation structure
Coverage
Temporal
Topical
Spatial
Location specific information
13. Interacting Standards for Data Dublin Core
ISO/IEC 11179
ISO 19115 – Geography
Statistical Packages
METS
PREMIS
SDMX
DDI
Structure and content of a data element as the building block of information
Supports registry functions
Provides
Object
Property
Representation
14. Interacting Standards for Data Dublin Core
ISO/IEC 11179
ISO 19115 – Geography
Statistical Packages
METS
PREMIS
SDMX
DDI i.e., ANZLIC and US FGDC
Focus is on describing spatial objects and their attributes
15. Interacting Standards for Data Dublin Core
ISO/IEC 11179
ISO 19115 – Geography
Statistical Packages
METS
PREMIS
SDMX
DDI Proprietary standards
Content is generally limited to:
Variable name
Variable label
Data type and structure
Category labels
Translation tools used to transport content
16. Interacting Standards for Data Dublin Core
ISO/IEC 11179
ISO 19115 – Geography
Statistical Packages
METS
PREMIS
SDMX
DDI Digital Library Federation
Consistent outer wrapper for digital objects of all type
Contains a profile providing the structural information for the contained object
17. Interacting Standards for Data Dublin Core
ISO/IEC 11179
ISO 19115 – Geography
Statistical Packages
METS
PREMIS
SDMX
DDI Preservation information for digital objects
18. Interacting Standards for Data Dublin Core
ISO/IEC 11179
ISO 19115 – Geography
Statistical Packages
METS
PREMIS
SDMX
DDI Developed for statistical tables
Supports well structured, well defined data, particularly time-series data
Contains both metadata and data
Supports transfer of data between systems
19. Interacting Standards for Data Dublin Core
ISO/IEC 11179
ISO 19115 – Geography
Statistical Packages
METS
PREMIS
SDMX
DDI Version 3.0 covers life-cycle of data and metadata
Data collection
Processing
Management
Reuse or repurposing
Support for registries
Grouping & Comparison
20. Metadata Coverage Dublin Core
ISO/IEC 11179
ISO 19115
Statistical Packages
METS
PREMIS
SDMX
DDI
[Packaging]
Citation
Geographic Coverage
Temporal Coverage
Topical Coverage
Structure information
Physical storage description
Variable (name, label, categories, format)
Source information
Methodology
Detailed description of data
Processing
Relationships
Life-cycle events
Management information
21. The Data Documentation Initiative (DDI) International XML based specification
Started in 1995, now driven by DDI Alliance (30+ members)
Became XML specification in 2000 (v1.0)
Current version is 2.1 with focus on archiving (codebook)
New Version 3.0 (2008)
Focus on entire survey “Life Cycle”
Provide comprehensive metadata on the entire survey process and usage
Aligned on other metadata standards (DC, MARC, ISO/IEC 11179, SDMX, …)
Include machine actionable elements to facilitate processing, discovery and analysis
22. Intent of DDI Design Facilitate point-of-origin capture of metadata
Reuse of metadata to support:
Consistency and accuracy of metadata content
Provide internal and external implicit comparisons
Support external registries of concepts, questions, variables, etc.
Metadata driven processing
Provide clear paths of interaction with other major standards
23. Basic Structures DDI 3 used a model similar to SDMX in terms of the following:
Indentifiable, Versionable, and Maintainable objects
The use of multiple schemas to describe different process sub-sections in the life-cycle
Use of schemes to facilitate reuse of common materials
24. DDI: Full content coverage for survey and administrative data Conceptual coverage
Methodology
Data Collection
Processing – cleaning, paradata
Recoding and derivations
Variable and tabular content
Internal relationships
Physical storage
Data management
25. Plus: Relationships between studies Comparison by design
Study series can inherit from earlier metadata
Capture changes only
Data integration
Mapping of codes between source and target
Capture comparison information
Comparison of abstract content models
Publication of reusable materials (code schemes, concept schemes, geographic structure, etc.)
26. Current Areas of DDI Development Controlled vocabularies to improve machine actionability
Data collection methodology and process expansion for more depth and detail
Qualitative data
Increased comparison coverage
Tools
27. DDI 3.0 Metadata Life Cycle Data and metadata creation is not a static process: It dynamically evolved across time and involves many agencies/individuals
DDI 2.x is about archiving, DDI 3.0 focuses on the entire “life cycle”
3.0 emphasizes metadata reuse to minimize redundancy and discrepancies, support comparison, and drive the data and metadata creation process
Supports multilingual, grouping, geography, and registries
3.0 is extensible
28. When to capture metadata? Metadata must be captured at the time the event occurs! (not after the facts)
Documenting after the facts leads to considerable loss of information
This is true for producers and researchers The first figure outlines the various stages of a survey production process
The graph on the right illustrates the amount of metadata that is typically recovered if the knowledge capture occurs after the fact (blue line) versus the amount of metadata actually generated throughout the process (red line)The first figure outlines the various stages of a survey production process
The graph on the right illustrates the amount of metadata that is typically recovered if the knowledge capture occurs after the fact (blue line) versus the amount of metadata actually generated throughout the process (red line)
29. Reuse DDI is designed around schemes (lists of items) for commonly reused information within a study such as categories, code schemes, concepts, universe, etc.
Items are “used” in multiple locations in a DDI document by referencing the item in the list
Enter once, use in multiple locations
Items can be versioned for management over time without having to change content in multiple locations
30. Comparison and Registries Information in DDI schemes can be published in external registries and used by multiple studies
Provides implicit comparison both within a study and between studies
Supports organizational consistency through the use of agreed content managed in registries
Referencing structured lists provides further context to individual items used in a study
31. Metadata driven processing Capturing metadata upstream can provide over 90% of the building blocks needed to generate the remainder of the metadata
DDI supports imbedding command code to run data processing events driving data capture, data processing during after collection, and to support post-collection recoding, derivations, and harmonization maps
32. Questions to Variables
33. Working with other standards There is no single standard that does it all
DDI was specifically designed to support easy interaction with:
Dublin Core – mapping of citation elements and imbedding native Dublin Core
ISO/IEC 11179 – working with an editor of the standard to reflect data element model and ISO/IEC 11179-5 naming conventions for registry intended items
34. Standards continued SDMX – DDI NCubes were revised to incorporate the ability to attach attributes to any area of a cube and map cleanly into and out of SDMX cubes. SDMX has added means of attaching fragments of DDI which provide source and processing information that can be indexed and delivered through SMDX tools.
ISO 19115 (ANZLIC) – Geographic elements in DDI are structured to reflect basic discovery elements used by geographic search engines and provide the detailed geographic structure information needed by GIS system to incorporate the data accurately
35. DDI does not replace good content DDI structures metadata to leverage content
Collection and processing
Discovery and access
Analysis and repurposing
Registries
Comparison
DDI is not a software application
Supports and informs software applications
DDI is a neutral archival structure
Preserving content and relationships
36. Value Supports consistent use concepts, questions, variables, etc. throughout organization
Supports implicit comparison through reuse of content
Supports explicit comparison by mapping content between studies and to standard content
Retention of explicit relationships between data collection and the resulting data files
Early capture of a broad range of metadata at point of creation
37. Value - continued Interoperability
Flexibility in data storage
Reuse of element structures
Strong data typing
Improved data mining between and across systems
Improved access to detailed metadata
38. DDI User Base Archives and data libraries worldwide
Catalogs
Data delivery
Documentation delivery from data systems
Research Institutes/Services Data Centers
Documentation for data
Data search and analysis systems
Data management systems
International Organizations and National Statistical Agencies
Data collection and management
39. Archives and Data Libraries (examples) Catalogs
ICPSR Data Catalog and Social Science Variable Database
CESSDA Data Portal
The Dataverse Network (former Virtual Data Collection)
Data delivery
California Digital Library “Counting California”
National Geographic Historical Information System
Documentation delivery
Survey Documentation and Analysis (SDA)
Data Liberation Initiative Metadata Collection
40. Research Institutes/Service Data Centers (examples) Documentation for data
German Microcensus (GESIS)
Institute for the Study of Labor (IZA)
US General Social Survey (NORC)
Data search and analysis systems
Nesstar
Canadian Research Data Centres (RDC’s)
Data management systems
Questionnaire Development Documentation System (University of Konstanz/GESIS)
41. Current DDI Products at ICPSR Most existing products currently in DDI 2.1 with new additions moving to DDI 3
DDI-XML variable-level codebooks output as PDF files for downloading by users
DDI-XML metadata records created initially by data depositors and edited by ICPSR staff to augment content and include additional fields
Increasing use of DDI for special projects: Social Science Variables Database, various harmonization and data processing tasks
42. Potential Use of DDI 3 at ICPSR Information collected from data producers in pre-collection phase – Concept
Metadata output from CAI applications – Data Collection
Processor‘s dashboard – Metadata Processing
Metadata mining: New faceted search tool to facilitate discovery through more precise searching – Data Discovery
Relational database for comparison and harmonization across studies – Repurposing
43. Potential Use of DDI 3 at ICPSR - 2 Use of DDI in combination with other metadata standards, e.g., Dublin Core, MARC, PREMIS
Beginning of FEDORA “object-centered“ implementation concepts into data processing and data preservation strategies
Processor‘s dashboard – Data Processing
Relationships of study object to file object
DDI 3 as “wrapper“ for all ICPSR metadata?
46. SSVD – The Public Search First batch of variable-level description files uploaded into SSVD:
Approx. 3,500 DDI files (one file per dataset), representing
Approx. 1,300 ICPSR studies (approx. 18.5 percent of total ICPSR holdings, excluding US Census; approx. 30 percent of holdings with data and setups)
Over 1,000,000 individual variable descriptions; 23,000,000 categories
47. SSVD – The Public Search New database finalized Fall 2008
Built to match DDI 3.0 data model
Both DDI 2.x and DDI 3.0 compliant
Designed to accept both DDI 2.x and 3.0 input and produce output in both versions
ICPSR version currently uploads DDI 2.1 and generates DDI 3.0 individual variable descriptions.
DDI 3 AS EXPORT FORMATDDI 3 AS EXPORT FORMAT
48. SSVD – The Public Search Moving forward… Transition to automated DDI upload
DDI uploaded at the time of study publication
First quality check performed by study processing staff
Acceptable DDI immediately released for public view
Problematic DDI suppressed from public view for further review, and upgrade as appropriate
51. IPUMS at MPC Did not use DDI because DDI 2 cannot handle translation tables
Currently in the process of mapping DDI 3 codebook output from IPUMS database
Importing DDI 2 files from Microdata Toolkit into processing, validation, and harmonization system
52. NHGIS Contains historical aggregate data from population, housing, agricultural, and economic censuses as well as BEA data from 1790 to 2000
Runs from DDI 2 nCube descriptions
Searches variables, identifies related nCube tables, determines geographic availability
Generates data subsets with geographic links to objects in NHGIS shape files, and shape files
53. Future Plans Funding has been obtained to improve search and extraction system
Current limitations of the system reflect limitations of DDI 2
Moving to DDI 3 will support:
broader cross survey searching
identification of common dimensions between NCubes over time
support harmonization instructions as well as common transformations such as calculation of medians
54. International Organizations and National Statistical Agencies International Household Survey Network (IHSN)
Major international organizations involved
Coordination of activities
Adopted DDI 1/2.x as standard
Developed the Microdata Management Toolkit and related tools / guidelines
http://www.surveynetwork.org
Accelerated Data Program (ADP)
World Bank / Paris 21
Implement IHSN activities in developing countries
Task 1. Documentation and dissemination of existing survey microdata.
Has introduced DDI in national statistical agencies in over 50 countries
http://www.surveynetwork.org/adp
55. INDEPTH/DSS Example 38 Demographic Surveillance Sites in 19 countries spanning Africa, South Asia, Central American and Oceania
Diverse yet similar health research portfolios
Data management goals:
Standardize and harmonize data collection tools
Cross-site comparability of information
Sharing data effectively and efficiently
56. Reasons for choosing DDI “It will be ideal to describe our data for the purposes of the Data Repository”
“It has really powerful features that will enable us to standardise several facets of our work.”
“I originally underestimated the usefulness DDI will have as a means to harmonised data collection between sites.”
Ability to expand comparison and harmonization with additional groups such as AIDS research team
57. Statistical Agencies BLS considering publication of category and coding standards supported by BLS such as NAICS, SOC etc
Statistics Canada considering publishing concept schemes in DDI 3 for use by the research community
DDI is becoming more widely used for survey and census collection in developing countries (primarily Africa)
58. MQDS Version 1 Extracted metadata from Blaise data model as XML tagged data
Provided user interface for selection of
Blaise files
Instrument questions and sections
Types of metadata to extract
Languages to display
Style sheet for generation of instrument documentation or codebook
59. Using MQDS V1 XML: Codebook in Five Languages
60. MQDS Version 1 Limitations
XML not DDI-compliant
DDI Version 2 did not have XML tags for all metadata provided by Blaise
Did not provide easy means of adding XML tags without becoming noncompliant
XML files for complex surveys can be very large (text files)
Entire files had to be processed in computer memory
Limited ability to fully automate documentation
61. DDI Version 3 Included extensions proposed by DDI working group on instrument design
62. MQDS Version 3 Joint SRC and ICPSR venture
Goals:
Address version 2 limitations
Process Blaise instrument of any size
Exploit new elements and validate to the recently released DDI version 3 standard
Move from processing XML metadata in memory to streaming metadata to a relational database
63. MQDS Version 3Relational Database: Import, Export, Transform
64. MQDS Version 3 Relational database
DDI compliant standardized tables
Flexibility for SRC and ICPSR to add extensions that meet their specific organizational needs
Allows
Automated documentation of any Blaise survey instrument
Importing and documenting data produced by other software
Lower cost development of other tools that facilitate editing and disseminating data
67. Colectica Feature Overview
68. Current Focus: Data Collection
69. Survey Design: Diagram Visually design survey instruments
Drag items from the toolbox
70. Survey Design: Item Editor Edit item details using friendly input forms
71. Multilingual Support All text fields can be represented in multiple languages
72. Concept Repository Use built-in or custom concept banks to describe survey items
Useful for comparability
73. Question Repository Share questions across studies
Drag previously-used questions or sequences onto new instruments
74. Import Existing CAI Code Import from:
Blaise®
CASES
CSPro
Support for additional languages can be added
75. Generate CAI Source Code Currently support CASES
Blaise® and CSPro coming soon
Support for additional CAI systems can be added
76. Generate Publishable Documentation Generate codebooks and diagrams
Output to HTML and PDF
77. Also: Study Concept & Design Basic support for Study Concept & Design documentation
78. Generate DDI 3.1
79. Additional Information Beta available now
Web: http://www.colectica.com/
Email: contact@algenta.com
80. Thank you DDI Alliance
http://www.ddialliance.org
Wendy Thomas
wlt@pop.umn.edu