390 likes | 426 Views
The Data Documentation Initiative (DDI). Peter Granda Peter Joftis Inter-university Consortium for Political and Social Research. What is the DDI?.
E N D
The Data Documentation Initiative (DDI) Peter Granda Peter Joftis Inter-university Consortium for Political and Social Research CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
What is the DDI? • The Data Documentation Initiative (DDI) is an international effort to develop a specification for the content and structure of the meta data describing empirical research in the social and behavioral sciences. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Who developed the DDI? There are contributors from university and government data producers, libraries, data archives, and researchers. They are stakeholders in the social science research process in North America and Europe. There was funding from the U.S. National Science Foundation but most of the effort was contributed by the participants and their institutions. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Why was the DDI Developed? • To replace existing documentation specifications: • Insufficient structure and meaning of the documentation. • many idiosyncratic formats (e.g., OSIRIS) • wanted to encourage standardization to enhance interchangeability • wanted a format that is resistant to corruption • To take advantage of new technologies to enhance the usefulness of data documentation CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Issues during DDI Development • Diverse communities (librarians, archivists, producers, developers) had different emphasis: Document centered versus Data centered issues. • Simplicity versus completeness. • Normative versus suggestive models. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
What was the path taken? • Because of the initial focus on documentation, a markup language was chosen. • Standard Generalized Markup Language (SGML) was proposed. • Extensible Markup Language (XML) was eventually chosen. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Document-Centered versus Data Centered Approaches • Document centered where expecting: • Content is unstructured and unconstrained. • Not fixed in length, possibly quite large. • Data centered where expecting: • Information in structured into fields. • The content of the fields is constrained. • Information is fixed length or the maximum length can be set. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Archival holdings are a Hybrid • ICPSR holds much of its meta data in a relational database • Citation, bibliographic and abstract information was in a text oriented database. • Documentation is in a variety of formats but is held as files on our server. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Why XML? • Designed for use on WWW which is key to discovery and dissemination. • Hardware & Software independent. • non-proprietary • interoperable across a wide range of sites • Extensible: allows writing of specialized vocabularies such as the DDI. • Software needs to understand XML but does not need modification to support tags relevant to social science data • Separates content from format, which simplifies multiple uses of same source document. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Why XML? (2) • Markup is plain text. • human readable • easier to preserve than non-text formats • XML specification is publicly published. • Markup makes it easier for software to locate information, creating tremendous possibilities. • XML documents are flexible and modular, an approach that could support possibilities we have not even anticipated. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Document Type Definitions (DTDs) • DTDs are part of the XML specifications • DTDs provide a set of rules for determining if a particular document is valid • Describes the structure and syntax of an XML document. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Document Type Definition (DTD) • Allows validation and limits the content: what elements are allowed, required, disallowed, defaulted. • DTD makes a specialized vocabulary such as DDI sharable by publishing and enforcing a set of rules. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
What is wrong with DTDs? • DTDs have a simple content model. • They allow only sequences or lists of elements or IDs. • Elements can occur only 0, 1, or many times. • Character strings are the principal datatype; there are no dates, numbers, or the other datatypes found in most programming/scripting languages. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
What is wrong with DTDs? • There can be only one DTD per document and it is essentially static. • DTDs are not fully object oriented: they lack inheritance which is the ability to describe new elements in terms of existing elements CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Alternatives to DTDs Exist • HyTime Hub Document Architectures • XML Schema • Resource Description Framework (RDF) CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
HyTime Hub Documents • Hub documents would keep the DTD approach. • The hub document would be a collection of links rather than actual content • Hierarchical and other non-rectangular data collections would be modeled by the links. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
XML Schema • Provide for a more detailed model of content. • Can specify number of occurrences. • Can use additional datatypes: string, numeric, date/time, etc. • Can have structures and user defined datatypes. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Other advantages of XML Schemas • Schemas are dynamic and more than one can apply to a single document. An application could subset based on user input: say country of requestor. • Schemas support the Object Model more closely than DTDs. • Tools exist for converting DTDs to Schemas. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Resource Description Framework • This is really Jostein’s area of expertise. • RDF depends on precise meta data vocabularies which can take considerable work to setup. • RDF’s strength is the way it provides information to intelligent software agents. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Where are we now? • Organized in May, 1995 • First DTD, April, 1996 • Revised DTD, Summer, 1998 • Beta testing, March-October, 1999 • Revised DTD, Winter, 1999 • Released Version 1, March 2000 • Plan minor revisions Winter, 2000 • Plans for version 2, 2001? CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
And Here is What it Looks Like • The Outline of the structure • Tag library explaining the elements • The Document Type Definition itself • Sample marked up codebooks CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Current DDI Committee Agenda • Approval of minor revisions for version 1.1: errata, note fields and other non-invalidating changes. • Address structural shortcomings of version 1 leading to version 2 • Encourage the development of applications using DDI CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Plans for Version 2 • Handling of the routing structure of CATI/CAPI instruments. • Add ways to handle aggregated, tabular and time series data. • Extend specification to handle more than rectangular and simple hierarchies. • Add the ability to markup “families of studies across time, geography or topic. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
What can you do with DDI? • Transform an XML marked-up document to create: • HTML for web pages • Formatted text for printing or other display purposes • Syntax files for statistical packages • Importable text files for loading databases or library catalogs CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
What can you do with DDI? (pt. 2) • Enhance searches using tags to identify relevant information; search by topic across sources. • Create “families” of comparable collections across data sources related by geography, time, or subject. • Integrate heterogeneous, distributed data sources (one user interface). CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Who is using DDI? (pt. 1) • Networked Social Science Tools and Resources, European Union (NESSTAR) (www.nesstar.org) • Survey Documentation and Analysis, Computer Survey Methods Program, UC Berkeley (SDA) (csa.berkeley.edu) • Federal Electronic Research and Review Extraction Tool (FERRET) (ferret.bls.census.gov) CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Who is Using DDI? (pt. 2) • Virtual Data Center Project, Harvard-MIT Data Center (VDC) (thedata.org) • Sociology Workbench Bench, EdCenter on Computational Science and Engineering, San Diego State University (SWB) (edcenter.sdsu.edu) CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Who is Using DDI? (pt. 3) • GRETA U.S. Census (aggregate) data access system, University of Minnesota(http://www.socsci.umn.edu/PDAS/system_design.html) • USA Counties, University of California, Berkeley (http://ucdata.berkeley.edu/) • Statistical Data Collection, University of Virginia Library (http://quincy.lib.virginia.edu/cgi-local/dlbin/loader.pl?id=2717) CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
ICPSR Experience with DDI • Turn over to Peter Granda CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Closing thoughts “Hardest thing about meta data is producing it.” • Data producers must see the value of mark-up in enhanced applications to justify cost. • Data producers will need tools that produce markup as part of the production process or cost will be too high. • Archives will find enhancing producer supplied mark-up will be one of their value added tasks. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Further Information • DDI Website http://www.icpsr.umich.edu/DDI/ • Electronic mail contacts: ddi@icpsr.umich.edu CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Are there Questions or Comments? CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Internal Linkage Mechanisms • The DDI DTD contains a method for arbitrarily linking elements. • There are attributes for linking to information on related publications or access conditions. • Links exist connecting summary data and methodology information across parts of the document. CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Link Reference for Note <var name="Q19" source="producer" wgt="not-wgt" intrvl="discrete"> <location StartPos="49" EndPos="49" width="1" RecSegNo="1" source="producer" /> <qstn ID="Q19" source="producer"> <qstnLit source="producer">In general, is your opinion of the Democrat party favorable or not favorable?</qstnLit> <Link refs="NQ19" source="producer" /> </qstn> CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Link Reference for Note <notes ID="NQ19" source="producer">Asked only on August 18</notes> </var> CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Link Reference for Note <notes ID="NQ19" source="producer">Asked only on August 18</notes> </var> CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
Link Reference for Note <notes ID="NQ19" source="producer">Asked only on August 18</notes> </var> CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000
External Linkage Mechanism • The DDI DTD contains an external linking mechanism permitting links from elements in the DTD to items outside the document using URIs. For example, • vocabURI attribute added to specify location for full controlled vocabulary. • URI attribute added to Location element to provide a URN or URL for the address from which the data may be downloaded. • URIs may also be used as citations/references to external materials (journal articles or chapters in articles or documents). CESSDA Expert Seminar, Tampere, Finland Sept, 1-2, 2000