1 / 27

Statistical Research Data on the Semantic Web

This research paper explores the representation and accessibility of research data in economics and scientific practices. It discusses the importance of data repeatability and transparency, as well as the challenges and benefits of data sharing. The paper also highlights the needs and preferences of researchers in managing and accessing economic research data.

hernandezc
Download Presentation

Statistical Research Data on the Semantic Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Research Data on the Semantic Web Daniel BahlsLeibniz Information Centre for Economics (ZBW) SWIB 2012 Cologne, Germany

  2. Outline • Introduction • Research data in economics and scientific practices • Thoughts on data representation • Repeatability of research results • Outlook • Data access and retrieval • Proxies and empirical models

  3. MaWiFo Project Management of Economic Research Data

  4. „What researchers want“ • Tools and services must be in tune with researchers’ workflows, which are often discipline-specific • They must be easy to use • “Cafeteria model”: researchers can pick and choosefrom a set of tools and services • Benefits must be clearly visible – not in three years’time, but now Source: Feijen (2011)

  5. Research Dataas Bibliographic Artefacts • Re-useData Sharing gives more opportunities for research • Citation • Data acquisition and assignement of Persistent Identifiers • TransparencyReproducibility: Fundamental criteria for good scientific practice

  6. Research data in economics and scientific practices • Target Group: Researchers in Economics • Community Building for Knowledge Exchange: Economists – Data Librarians – Computer Scientists • Interviews on • Data Management Sharing Sources Publishing • Processing

  7. How does Research Data look like in Economics?

  8. Interviews with Researchers in Economics Data Agencies Own Surveys & Studies Local File System Statistical Offices Sources Data Management Backup Server Trusted Institutesand Researchers DVD, External HD, ... not includedin review process Execution Times: seconds, minutes, hours SPSSStata Matlab ... Publishing Processing Zip Files practiced sometimes ProgrammingLanguages Sharing On Request (?) High PerformanceComputing Trusted Colleagues Within Teams Seite 8

  9. Particular Findings • Research is driven by the availability of data (to some extent) • Some research is based on external data, • Some research is based on self-conducted studies • Combining and Merging of data sets in average, 66% ofthe data comes from external sources (estimated)

  10. Particular Findings • Data Usage Rights – e.g. Thomson-Reuters Datastream • Data Protection • on-site access, virtual access • sample data to understand structure • analysis scripts • aggregation • protection maintained? Copy to third party?

  11. Thoughts on Data Representation Often, the legal situation does not allow for publishing the entire data set as was used ? transparency re-use data review repeatability curation

  12. Interim Conclusion • A model based on copying is insufficient • We suggest fine-grained referencing • single data items must be referenceable (merging, curation) • highly distributable (distributed data sources) • extensible (heterogeneous long tail data, curation) • LOD-based approach

  13. DataSet type external dataset Data Items UserDataSet includesData type Data Itemsfrom own survey

  14. RDF-Representation for Statistical Data used for our example SourceData Cube vocabulary StatsWales: Life Expectancy, Dataset 003311

  15. DataSet Dimension Item DimValue dataProperty example: label time A 2005-7 rdf:value X label region B 83.7 Cardiff gender label C Female

  16. RDF-Representation for Statistical Data Using the semantic model, referencing of data at a very detailed level is possible - without need for the data itself to be public Challenge:Stable URIs required for every single data item label time A 2005-7 rdf:value X label region protected B 83.7 Cardiff gender label C Female you can omit single information itemssuch as the value itself, yet the data is still referenceable

  17. SCOVO

  18. RDF Data Cube Vocabulary (QB) source:http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/cube.html

  19. Repeatability of research results • missing values • seasonal adjustment • purchasing power adjustment • plausibility tests • basket analyses • ... aggregationand data cleaning ? McCullough, B. D. Got Replicability? The _Journal of Money, Credit and Banking_ Archive Econ Journal Watch, 2007, 4, 326-337 Interesting read

  20. Repeatability of research results • scripts (“do-files”) • working copies of data • change parameters, so thateffect can be shown clearly • no overall build process

  21. A build script for empirical analyses • Maven-like, ANT-like

  22. DataSet No gaps Trust Incentive type external dataset Data Items UserDataSet includesData type buildScript Data Itemsfrom own survey

  23. Communication & Architecture Archive C Archive D Archive B Archive A Authenticate & Request Data Reference Model Digital Library DOI Client

  24. Open Challenges (practical) • Researchers in economics would love to re-use data from others. • Researchers in economics hesitate to share their data. • Competitive advantage: • “We put too much effort into data production, • so we want to be the ones to publish on it.” • “The code discloses too much of our know-how.” • Incentives needed: • Data citation • Trust in research results (no gaps from data sources to results)

  25. Open Challenges (technical) • Precise referencing: • A unique URI for every data item / table cell ? How about curation and data versioning ? • Maven-like build scripts: • How to specify entire system environments and software modules? • Vocabulary extensions: • Specific data needs specific description, • where do the necessary rdf:Properties come from?

  26. Summing up • Reference model for exact reconstruction of research data sets • Build scripts and dependency management for repeatability • Transparency of data sources and processes • “executable paper”, learning from others, data reviews,.... • rerun analysis – with curated values – with latest data

  27. Thank you

More Related