270 likes | 283 Views
This research paper explores the representation and accessibility of research data in economics and scientific practices. It discusses the importance of data repeatability and transparency, as well as the challenges and benefits of data sharing. The paper also highlights the needs and preferences of researchers in managing and accessing economic research data.
E N D
Statistical Research Data on the Semantic Web Daniel BahlsLeibniz Information Centre for Economics (ZBW) SWIB 2012 Cologne, Germany
Outline • Introduction • Research data in economics and scientific practices • Thoughts on data representation • Repeatability of research results • Outlook • Data access and retrieval • Proxies and empirical models
MaWiFo Project Management of Economic Research Data
„What researchers want“ • Tools and services must be in tune with researchers’ workflows, which are often discipline-specific • They must be easy to use • “Cafeteria model”: researchers can pick and choosefrom a set of tools and services • Benefits must be clearly visible – not in three years’time, but now Source: Feijen (2011)
Research Dataas Bibliographic Artefacts • Re-useData Sharing gives more opportunities for research • Citation • Data acquisition and assignement of Persistent Identifiers • TransparencyReproducibility: Fundamental criteria for good scientific practice
Research data in economics and scientific practices • Target Group: Researchers in Economics • Community Building for Knowledge Exchange: Economists – Data Librarians – Computer Scientists • Interviews on • Data Management Sharing Sources Publishing • Processing
Interviews with Researchers in Economics Data Agencies Own Surveys & Studies Local File System Statistical Offices Sources Data Management Backup Server Trusted Institutesand Researchers DVD, External HD, ... not includedin review process Execution Times: seconds, minutes, hours SPSSStata Matlab ... Publishing Processing Zip Files practiced sometimes ProgrammingLanguages Sharing On Request (?) High PerformanceComputing Trusted Colleagues Within Teams Seite 8
Particular Findings • Research is driven by the availability of data (to some extent) • Some research is based on external data, • Some research is based on self-conducted studies • Combining and Merging of data sets in average, 66% ofthe data comes from external sources (estimated)
Particular Findings • Data Usage Rights – e.g. Thomson-Reuters Datastream • Data Protection • on-site access, virtual access • sample data to understand structure • analysis scripts • aggregation • protection maintained? Copy to third party?
Thoughts on Data Representation Often, the legal situation does not allow for publishing the entire data set as was used ? transparency re-use data review repeatability curation
Interim Conclusion • A model based on copying is insufficient • We suggest fine-grained referencing • single data items must be referenceable (merging, curation) • highly distributable (distributed data sources) • extensible (heterogeneous long tail data, curation) • LOD-based approach
DataSet type external dataset Data Items UserDataSet includesData type Data Itemsfrom own survey
RDF-Representation for Statistical Data used for our example SourceData Cube vocabulary StatsWales: Life Expectancy, Dataset 003311
DataSet Dimension Item DimValue dataProperty example: label time A 2005-7 rdf:value X label region B 83.7 Cardiff gender label C Female
RDF-Representation for Statistical Data Using the semantic model, referencing of data at a very detailed level is possible - without need for the data itself to be public Challenge:Stable URIs required for every single data item label time A 2005-7 rdf:value X label region protected B 83.7 Cardiff gender label C Female you can omit single information itemssuch as the value itself, yet the data is still referenceable
RDF Data Cube Vocabulary (QB) source:http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html
Repeatability of research results • missing values • seasonal adjustment • purchasing power adjustment • plausibility tests • basket analyses • ... aggregationand data cleaning ? McCullough, B. D. Got Replicability? The _Journal of Money, Credit and Banking_ Archive Econ Journal Watch, 2007, 4, 326-337 Interesting read
Repeatability of research results • scripts (“do-files”) • working copies of data • change parameters, so thateffect can be shown clearly • no overall build process
A build script for empirical analyses • Maven-like, ANT-like
DataSet No gaps Trust Incentive type external dataset Data Items UserDataSet includesData type buildScript Data Itemsfrom own survey
Communication & Architecture Archive C Archive D Archive B Archive A Authenticate & Request Data Reference Model Digital Library DOI Client
Open Challenges (practical) • Researchers in economics would love to re-use data from others. • Researchers in economics hesitate to share their data. • Competitive advantage: • “We put too much effort into data production, • so we want to be the ones to publish on it.” • “The code discloses too much of our know-how.” • Incentives needed: • Data citation • Trust in research results (no gaps from data sources to results)
Open Challenges (technical) • Precise referencing: • A unique URI for every data item / table cell ? How about curation and data versioning ? • Maven-like build scripts: • How to specify entire system environments and software modules? • Vocabulary extensions: • Specific data needs specific description, • where do the necessary rdf:Properties come from?
Summing up • Reference model for exact reconstruction of research data sets • Build scripts and dependency management for repeatability • Transparency of data sources and processes • “executable paper”, learning from others, data reviews,.... • rerun analysis – with curated values – with latest data