Statistical Research Data on the Semantic Web

Statistical Research Data on the Semantic Web Daniel BahlsLeibniz Information Centre for Economics (ZBW) SWIB 2012 Cologne, Germany

Outline • Introduction • Research data in economics and scientific practices • Thoughts on data representation • Repeatability of research results • Outlook • Data access and retrieval • Proxies and empirical models

MaWiFo Project Management of Economic Research Data

„What researchers want“ • Tools and services must be in tune with researchers’ workflows, which are often discipline-specific • They must be easy to use • “Cafeteria model”: researchers can pick and choosefrom a set of tools and services • Benefits must be clearly visible – not in three years’time, but now Source: Feijen (2011)

Research Dataas Bibliographic Artefacts • Re-useData Sharing gives more opportunities for research • Citation • Data acquisition and assignement of Persistent Identifiers • TransparencyReproducibility: Fundamental criteria for good scientific practice

Research data in economics and scientific practices • Target Group: Researchers in Economics • Community Building for Knowledge Exchange: Economists – Data Librarians – Computer Scientists • Interviews on • Data Management Sharing Sources Publishing • Processing

How does Research Data look like in Economics?

Interviews with Researchers in Economics Data Agencies Own Surveys & Studies Local File System Statistical Offices Sources Data Management Backup Server Trusted Institutesand Researchers DVD, External HD, ... not includedin review process Execution Times: seconds, minutes, hours SPSSStata Matlab ... Publishing Processing Zip Files practiced sometimes ProgrammingLanguages Sharing On Request (?) High PerformanceComputing Trusted Colleagues Within Teams Seite 8

Particular Findings • Research is driven by the availability of data (to some extent) • Some research is based on external data, • Some research is based on self-conducted studies • Combining and Merging of data sets in average, 66% ofthe data comes from external sources (estimated)

Particular Findings • Data Usage Rights – e.g. Thomson-Reuters Datastream • Data Protection • on-site access, virtual access • sample data to understand structure • analysis scripts • aggregation • protection maintained? Copy to third party?

Thoughts on Data Representation Often, the legal situation does not allow for publishing the entire data set as was used ? transparency re-use data review repeatability curation

Interim Conclusion • A model based on copying is insufficient • We suggest fine-grained referencing • single data items must be referenceable (merging, curation) • highly distributable (distributed data sources) • extensible (heterogeneous long tail data, curation) • LOD-based approach

DataSet type external dataset Data Items UserDataSet includesData type Data Itemsfrom own survey

RDF-Representation for Statistical Data used for our example SourceData Cube vocabulary StatsWales: Life Expectancy, Dataset 003311

DataSet Dimension Item DimValue dataProperty example: label time A 2005-7 rdf:value X label region B 83.7 Cardiff gender label C Female

RDF-Representation for Statistical Data Using the semantic model, referencing of data at a very detailed level is possible - without need for the data itself to be public Challenge:Stable URIs required for every single data item label time A 2005-7 rdf:value X label region protected B 83.7 Cardiff gender label C Female you can omit single information itemssuch as the value itself, yet the data is still referenceable

SCOVO

RDF Data Cube Vocabulary (QB) source:http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html

Repeatability of research results • missing values • seasonal adjustment • purchasing power adjustment • plausibility tests • basket analyses • ... aggregationand data cleaning ? McCullough, B. D. Got Replicability? The _Journal of Money, Credit and Banking_ Archive Econ Journal Watch, 2007, 4, 326-337 Interesting read

Repeatability of research results • scripts (“do-files”) • working copies of data • change parameters, so thateffect can be shown clearly • no overall build process

A build script for empirical analyses • Maven-like, ANT-like

DataSet No gaps Trust Incentive type external dataset Data Items UserDataSet includesData type buildScript Data Itemsfrom own survey

Communication & Architecture Archive C Archive D Archive B Archive A Authenticate & Request Data Reference Model Digital Library DOI Client

Open Challenges (practical) • Researchers in economics would love to re-use data from others. • Researchers in economics hesitate to share their data. • Competitive advantage: • “We put too much effort into data production, • so we want to be the ones to publish on it.” • “The code discloses too much of our know-how.” • Incentives needed: • Data citation • Trust in research results (no gaps from data sources to results)

Open Challenges (technical) • Precise referencing: • A unique URI for every data item / table cell ? How about curation and data versioning ? • Maven-like build scripts: • How to specify entire system environments and software modules? • Vocabulary extensions: • Specific data needs specific description, • where do the necessary rdf:Properties come from?

Summing up • Reference model for exact reconstruction of research data sets • Build scripts and dependency management for repeatability • Transparency of data sources and processes • “executable paper”, learning from others, data reviews,.... • rerun analysis – with curated values – with latest data

Thank you

Statistical Research Data on the Semantic Web

Statistical Research Data on the Semantic Web

Presentation Transcript

Trust on the Semantic Web

Family History Research on the Semantic Web : Building a Semantic Prototype for Danish Genealogical Research

Finding knowledge, data and answers on the Semantic Web

Learning Objects on the Semantic Web

Finding knowledge, data and answers on the Semantic Web

Data Integration on the Semantic Sensor Web

Data on the (Semantic) Web

The Ontological Semantic Perspective on the Semantic Web

Semantic Web Fred Automated Goal Resolution on the Semantic Web

Performing Object Consolidation on the Semantic Web Data Graph

LCA data on the Semantic Web

Semantic Data lives everywhere on the Web

Agents on the Semantic Web

Languages on the Semantic Web

XML on Semantic Web

Data Quality on the Semantic Web

Finding knowledge, data and answers on the Semantic Web

Instance Data Evaluation on the Semantic Web

Searching for Knowledge and Data on the Semantic Web

NESSTAR: A Semantic Web Application for Statistical Data and Metadata

Multimedia on the Semantic Web

Semantic Similarity Computation on the Web of Data