150 likes | 161 Views
This presentation discusses the importance of data citation in social sciences, including the need for data availability and verification, the role of metadata, concerns about confidentiality, and the growing movement to require data behind findings to be publicly available. It also explores the challenges of versioning, granularity, and replication, and the efforts to create durable linkages between data and publications.
E N D
Data Citation for the Social Sciences Mary Vardigan ICPSR CODATA Conference on Data Attribution and Citation August 22-23, 2011
Today’s Presentation • Norms in the social sciences and implications for data citation • Summary of major citation issues for social science
Knowledge claims • Social science advances through knowledge claims published in the literature • Need to verify and extend claims; Secondary analysis encouraged • Follows that data need to be available for reuse and cited
Data sharing • Strong tradition of data sharing, both formal and informal • Active social science data archives around the world • Some PIs distribute data on Web sites • Pienta, Alter, and Lyle found 88.5% of data generated not publically archived (since 1985)
Metadata • Metadata play important role – Documentation necessary to understand the data • Questionnaires, user guides, methodology descriptions, record layouts also provided • Heterogeneous in format – most unstructured • Data Documentation Initiative (DDI) seeks to provide a structured metadata standard
Granularity and versioning • “Studies” may be single datasets or aggregations • Also a need to cite data subsets that support the findings in publications • Data are sometimes updated and need to be versioned
Content and formats • Mostly quantitative data and some qualitative • Boundaries blurring between social science and other domains • Survey data supplemented by biomarker data • Survey data merged with administrative records • Trend toward complex collections • Social media data • Video, audio data
Confidentiality concerns • Survey respondents promised anonymity, a critical pledge to uphold • Legal agreements required for restricted data use • New mechanisms to analyze restricted data online emerging – virtual enclaves and virtual datasets • Often a public-use version and restricted versions coexist
Replication • Most claims not able to be replicated based on information in publications • Replication archives -- ICPSR, Dataverse, etc. • What is required is chain of evidence and record of decisions – deep citation and provenance • Need both production transparency (record of decisions in transforming data) and analytic transparency (how conclusions drawn)
Some tradition of citation • Citation standard for machine-readable files created in 1979 • Citations available from data providers -- Census Bureau and ICPSR since late 1980s • Journals just beginning to cite data • Persistent identifiers: DOIs or handles
Journal practices • Historically little effort to standardize or verify data references in publications • Growing movement to require data behind findings to be publically available • AER: Will publish only if “data used in the analysis are clearly and precisely documented and readily available for replication.”
Influencing journals • Data-PASS campaign to influence journals sponsored by professional associations • Wrote to major professional associations demonstrating inconsistencies in citing data • Success with American Sociological Review, which changed submission criteria
Linking data and publications • ICPSR has done this since the beginning in 1962 • Now a Bibliography of 60K citations to publications with two-way linking to data • Vendors like Thomson Reuters now interested in these linkages
Summary -- Citation issues for social science • Versioning – Data can be dynamic • Unit/Granularity – What is optimal? • Importance of metadata – How to create durable link? • Replication –– Cite subsets and replication/workflow files containing scripts?
Thank you… • Mary Vardiganvardigan@umich.edu