1 / 74

Data Science: Acquisition, Curation, and Metadata Management

This presentation discusses the importance of data acquisition, curation, and metadata management in the field of data science.

mattr
Download Presentation

Data Science: Acquisition, Curation, and Metadata Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Example of Data Science andData and information acquisition (curation) and metadata/ provenance - management Thomas Hughes Data Science – ITEC/CSCI/ERTH 4350/6350 Week 2, September 8, 2015

  2. Admin info (keep/ print this slide) • Class: ITEC/CSCI/ERTH-4350/6350 • Hours: 9am-11:50am Tuesday • Location: Lally 102 • Instructor: Thomas Hughes • Instructor contact: hughet2@rpi.edu, 518.276.2315 (do not leave a msg) • Contact hours: Monday** 3:00-4:00pm (or by appt) • Contact location: **JEC 5018 • TA: Abinash Koirala; koiraa@rpi.edu • Web site: http://tw.rpi.edu/web/courses/DataScience/2015 • Schedule, lectures, syllabus, reading, assignments, etc.

  3. Review from last week • Data • Information • Knowledge • Metadata/ documentation • Data life-cycle

  4. Reading Assignments • Changing Science: Chris Anderson • Rise of the Data Scientist • Where to draw the line • What is Data Science? • An example of Data Science • If you have never heard of Data Science • BRDI activities • Data policy • Self-directed study (answers to the quiz) • Fourth Paradigm, Digital Humanities

  5. Rise of the Data Scientist • http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/ • http://drewconway.com/zia/2013/3/27/where-to-draw-the-line-on-data-science • http://radar.oreilly.com/2010/06/what-is-data-science.html • http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/1 • https://blogs.msdn.microsoft.com/escience/2009/10/16/the-fourth-paradigm-data-intensive-scientific-discovery-book-released/ • And other reading for this week…

  6. Metaphor • Anatomy study of the structure and relationship between body parts • Physiology is the study of the function of body parts and the body as a whole.

  7. Overused Venn diagram of the intersection of skills needed for Data Science (Drew Conway) Anatomy Physiology ? Missing Anatomy

  8. Data Science • Anatomy (as an individual) • Data Life Cycle – Acquisition, Curation and Preservation • Data Management and Products • Forms of Analysis, Errors and Uncertainty • Technical tools and standards

  9. Data Science • Physiology (in a group) • Definition of Science Hypotheses, Guiding Questions • Finding and Integrating Datasets • Presenting Analyses and Viz. • Presenting Conclusions

  10. Shifting the Burden from the Userto the Provider Fox CI and X-informatics - CSIG 2008, Aug 11

  11. Reminder • Science data (and information) challenges are being identified as increasingly common • Data (and information) science now accompanies theory, observation/experiment and simulation as a means of doing science • Scientists and technologist are not well prepared to cope with 21st century data management and use of tools • Making data available is now a responsibility not a privilege

  12. Skills needed • Database or data structures? • Literacy with computers and applications that can handle data • Ability to access internet and retrieve/ acquire data • Presentation of assignments • Working alone and in groups

  13. What is expected • Attend class, complete assignments • Participate (esp. reading discussion) • Ask questions* • Work both individually and in a group • Work constructively in group and class sessions

  14. And now a more detailed e.g.

  15. Why do we care about the Sun? • The Sun’s radiation is the single largest external input to the the Earth’s atmosphere and thus the Earth system. • Add, it varies – in time and wavelength • Also, for a long time - Solar Energetic Particles and the near Earth environment (and more recently the effect on clouds) • Observations commenced ~ 1940’s, with a resurgence in the late 1970’s • Two quantities of scientific interest • Total Solar irradiance - TSI in Wm-2 (adjusted to 1AU) • Solar Spectral Irradiance - SSI in Wm-2m-1or Wm-2nm-1 • Measure, model, understand -> construct, predict

  16. Solar radiation as a function of altitude

  17. Summary of Results • First comprehensive ‘database’ of: • Empirical models of the thermodynamic structure of the solar atmosphere suitable for different solar magnetic activity levels • First comprehensive (70 component) synthetic spectral irradiance database in absolute units • 10 disk angles, 7 models, far ultra- violet to far infrared, multi-resolution • ~724 GB • Strong validation in ultraviolet, visible, lines, infrared • Correct center to limb prediction for red-band irradiances • Found 30-45% network contribution to Ly-a irradiance • Several comparisons led to improvements in the atomic parameters • Led to choice of PICARD (new satellite) filter wavelengths

  18. Which brings us to DATA SCIENCE • Drum roll….. • Some dirty secrets • And some … universal truths…

  19. Needs (this is our mantra) Scientists should be able to access a global, distributed knowledge base of scientific data that: • appears to be integrated • appears to be locally available But… data is obtained by multiple means (models and instruments), using various protocols, in differing vocabularies, using (sometimes unstated) assumptions, with inconsistent (or non-existent) meta-data. It may be inconsistent, incomplete, evolving, and distributed. And created in a manner to facilitate its generation NOT its use. And… there exist(ed) significant levels of semantic heterogeneity, large-scale data, complex data types, legacy systems, inflexible and unsustainable implementation technology

  20. Back to the TSI time series…

  21. One composite, one assumption

  22. Another composite, different assumption

  23. Data pipelines: we have problems • Data is coming in faster, in greater volumes and forms and outstripping our ability to perform adequate quality control • Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision • We often fail to capture, represent and propagate manually generated information that need to go with the data flows • Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects • The task of event determination and feature classification is onerous and we don't do it until after we get the data • And now much of the data is on the Internet/Web (good or bad?)

  24. 20080602 Fox VSTO et al.

  25. Yes, it all was/ is about Provenance • Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility • More on that later in the class…

  26. Focus for the rest of this module • Preparing for data collection • Managing data • Data and metadata formats • Data life-cycle : acquisition • Modes of collecting • Examples • Information as data • Bias, provenance • Curation • Assignment 1

  27. Data-Information-Knowledge Ecosystem Producers Consumers Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation Context

  28. MIT DDI Alliance Life Cycle

  29. 20080602 Fox VSTO et al.

  30. Modes of collecting data, information • Observation • Measurement • Generation • Driven by • Questions • Research idea • Exploration

  31. Considerations Information has content, context and structure. The notion of “unstructured” really means information that is “unmanaged” with conventional technologies (i.e., metadata, markup and databases). Databases, metadata and markup provide control for managing digital information, but they are not convenient. This is why less than 20% of the available digital records are managed with these technologies (i.e., conventional technologies are not scalable). Search engines are extremely convenient, but provide limited control for managing digital information (i.e., long lists of ranked results conceal relationships within and between digital records). The search engine problem is that accessing more information does not equal more knowledge. We already have effectively infinite and instantaneous access to digital information. The challenge is no longer access, but being able to objectively integrate information based on user-defined criteria independent of scale to discover knowledge.

  32. Data Management reading • http://libraries.mit.edu/guides/subjects/data-management/cycle.html • http://esipfed.org/DataManagement • http://wiki.esipfed.org/index.php/Data_Management_Workshop • http://lisa.dachev.com/ESDC/ • Moore et al., Data Management Systems for Scientific Applications, IFIP Conference Proceedings; Vol. 188, pp. 273 – 284 (2000) • Data Management and Workflows http://www.isi.edu/~annc/papers/wses2008.pdf • Metadata and Provenance Management http://arxiv.org/abs/1005.2643 • Provenance Management in Astronomy http://arxiv.org/abs/1005.3358 • Web Data Provenance for QA http://www.slideshare.net/olafhartig/using-web-data-provenance-for-quality-assessment • W3C PROV

  33. Management • Creation of logical collections • The primary goal of a Data Management system is to abstract the physical data into logical collections. The resulting view of the data is a uniform homogeneous library collection. • Physical data handling • This layer maps between the physical to the logical data views. Here you find items like data replication, backup, caching, etc.

  34. Management • Interoperability support • Normally the data does not reside in the same place, or various data collection (like catalogues) should be put together in the same logical collection. • Security support • Data access authorization and change verification. This is the basis of trusting your data. • Data ownership • Define who is responsible for data quality and meaning

  35. Management • Metadata collection, management and access. • Metadata are data about data. • Persistence • Definition of data lifetime. Deployment of mechanisms to counteract technology obsolescence. • Knowledge and information discovery • Ability to identify useful relations and information inside the data collection.

  36. Management • Data dissemination and publication • Mechanism to make aware the interested parties of changes and additions to the collections.

  37. Logical Collections • Identifying naming conventions and organization • Aligning cataloguing and naming to facilitate search, access, use • Provision of contextual information • Related to metadata – why?

  38. Physical Data Handling • Where and who does the data come from? • How is it transferred into a physical form? • Backup, archiving, and caching... • Data formats • Naming conventions

  39. Interoperability Support • Bit/byte and platform/ wire neutral encodings • Programming or application interface access • Data structure and vocabulary (metadata) conventions and standards • Definitions of interoperability? • Smallest number of things to agree on so that you do not need to agree on anything else

  40. Security • What mechanisms exist for securing data? • Who performs this task? • Change and versioning (yes, the data may change), who does this, how? • Who has access? • How are access methods controlled, audited? • Who and what – authentication and authorization? • Encryption and data integrity

  41. Data Ownership • Rights and policies – definition and enforcement • Limitations on access and use • Requirements for acknowledgement and use • Who and how is quality defined and ensured? • Who may ownership migrate too? • How to address replication? • How to address revised/ derivative products?

  42. Metadata • Know what conventions, standards, best practices exist • Use them – can be hard, use tools • Understand costs of incomplete and inconsistent metadata • Understand the line between metadata and data and when it is blurred • Know where and how to manage metadata and where to store it (and where not to) • Metadata CAN be added later in many cases

  43. Persistence • Where will you put your data so that someone else (e.g. one of your class members) can access it? • What happens after the class, the semester, after you graduate? • What other factors are there to consider?

  44. Discovery • If you choose (see ownership and security), how does someone find your data? • How would you provide discovery of collections, versus files, versus ‘bits’? • How to enable the narrowest/ broadest discovery?

  45. Dissemination • Who should do this? • How and what needs to be put in place? • How to advertise? • How to inform about updates? • How to track use, significance?

  46. Data Formats - preview • ASCII, UTF-8, ISO 8859-1 • Self-describing formats • Table-driven • Markup languages and other web-based • Database • Graphs • Unstructured • Discussion… because this is part of your assignment

  47. Metadata formats • ASCII, UTF-8, ISO 8859-1 • Table-driven • Markup languages and other web-based • Database, graphs, … • Unstructured • Look familiar? Yes, same as data • Next week we’ll look at things like • Dublin Core (dc.x) • Encoding/ wrapper standards - METS • ISO in general, e.g. ISO/IEC 11179 • Geospatial, ISO 19115-2, FGDC • Time, ISO 8601, xsd:datetime

More Related