1 / 30

Using DCO Data ( Infrastructure , Management , Analysis, Visualization, …)

Using DCO Data ( Infrastructure , Management , Analysis, Visualization, …). Data Science. Peter Fox @ taswegian , pfox@cs.rpi.edu (Marshall Ma) and the Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT.

teva
Download Presentation

Using DCO Data ( Infrastructure , Management , Analysis, Visualization, …)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using DCO Data (Infrastructure, Management, Analysis, Visualization, …) Data Science Peter Fox @taswegian, pfox@cs.rpi.edu (Marshall Ma) and the Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT https://deepcarbon.net/group/dco-summer-school-2014

  2. Deep Carbon Observatory Global community of ‘Carbon’ scientists (~1000 from ~40 countries) contributing to aDeep Earth Computer (data legacy) comprising: • Global Earth Mineral Laboratory • Global Census of Deep Fluids • Global Volcano Gas Emissions • Global Census of Deep Microbial Life • Global State of High Pressure and Temperature Carbon and Related Materials • Global Inventory of Diamonds with Inclusions • …

  3. Data Science is … • Doing science with someone else’s data … • across datasets • with models • multi-dimensional, multi-scale, multi-mode • complex data-types • needing new analytic and visual approaches • Especially in multiple “dimensions” (functional) • E.g. Detection/ attribution methods/ algorithms • Visual exploration Data Science

  4. You may see many diagrams like

  5. Value and units? Physical quantity versus measured as quantity Reference frame? Reference units? Value and units?

  6. Use case: How DCO Finds Out About Data • Importing tool • A data repository • Repository staff/ • Data librarian • Transformed data ready for import • Internet • A data manager transforming data • Data • Spreadsheet • Diagram • Digital Map • Report • A scientist bringing new data (Fleischer, 2011)

  7. Data-Information-Knowledge “Ecosystem” Producers Consumers Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation Context

  8. Producers Consumers Quality Control Quality Assessment Fitness for Purpose Fitness for Use Trustor Trustee

  9. Spreadsheets • E.g. Excel – import data

  10. Documentation?

  11. Census of Deep Life • Substantial metadata – how to visualize THIS?

  12. To incline to one side; to give a particular direction to; to influence; to prejudice; to prepossess. [1913 Webster] • A partiality that prevents objective consideration of an issue or situation [syn: prejudice, preconception] • For acquisition – sampling bias is your enemy • Cognitive bias is (due to) YOU!

  13. Provenance* • Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility • Internal • External

  14. How you find DCO data…? • http://deepcarbon.net/dco_datasets • Will soon be a window into community-based sources • http://metpetdb.rpi.edu • http://earthchem.org/ • http://www.earthchem.org/petdb • http://vamps.mbl.edu/portals/deep_carbon/cdl.php • …

  15. Browser

  16. All information is linked and traceable! 16

  17. E.g. Deep Life (CoDL) New tools: R (statistics, visualization, modeling), D3.js (visualization) NOT just of the data, but of all types of information, knowledge! iPython Notebooks?

  18. When You Use Data – Science 2.0 • Version/ subsetting and converting to a format you are familiar with is very common but mysterious • Take notes – document – provenance • Software – what did you use and how? • Derived products – what did you create, how, why, etc. • Use the metadata every chance you get, e.g. filenames! • Place them in a Web-accessible folder, consider getting an identifier • Use social media, blogs, etc. to discuss it..

  19. 4 R’s … Goble and others

  20. Exercise 1 • Search for and access a dataset that you are not familiar with: • Can you read it? • Can you make sense of it? • Can you assess quality, uncertainty? • Any sources of bias? • What would you need to do to make it useful?

  21. When You Generate Data – Science 2.0 • How the data was generated, why, for what, when and in what format • Take notes – document – provenance • Software – what did you use and how? • Derived products – what did you create, how, why, etc. • Use the metadata every chance you get, e.g. filenames! • Place them in a Web-accessible folder, consider getting an identifier • Use social media, blogs, etc. to discuss it..

  22. Make it visible to DCO (can be private)https://deepcarbon.net/dco/dco-open-access-and-data-policieshttps://deepcarbon.net/page/submit-community-data You get an identifier! DCO-ID, can be cited, rewarded and much more… Share…

  23. DCO checklist: what people have to do (courtesy UC3) • Your data management plan • Funding agency requirements • Domain Scientist Curation Services & Tools • Creating your data • Data manager • Data Scientist • Organizing your data Data Science • Repository staff • Managing your data • Sharing your data • Domain scientists often also take up these two roles, • which however is not efficient and effective (i.e., the 80-20 rule).

  24. DCO checklist: a service & tool perspective + • Your data management plan • AP Sloan requirements+ Use cases, info. model • Object Modeling e.g., NSF New Proposal and Award Policies and Procedures Guide(effective January 14, 2013) • Creating your data Schema.org, etc. • CharacterizationServices • IdentityServices DCO-ID (Handle+DOI) • Organizing your data • CKAN, community StorageServices • Managing your data • IngestServices • CKAN, community • Faceted search and Drupal etc. • Discovery Service • Sharing your data • AccessServices • Linked Data, community

  25. Exercise 2 • Begin with a recent dataset that you generated or we’re involved in generating • Can someone else read it? • Can someone make sense of it? • Have you asserted quality, uncertainty? • Have you described known sources of bias? • What else would you now do to make it more useful?

  26. Further reading • Data Science course at RPI: http://tw.rpi.edu/web/Courses/DataScience/2013 • Fourth Paradigm: http://research.microsoft.com/en-us/collaboration/fourthparadigm/ • Data Management Planning tools: • http://tw.rpi.edu/web/project/DCO-DS/WorkingGroups/DMP • http://www.iedadata.org/compliance/plan • https://dmp.cdlib.org/

  27. Breakout Session Today • Exercises 1 and 2 • Discussion

  28. Friday • Marshall (Xiaogang) Ma will round out the data discussion • DCO goal for data: in the interim, • help you become data scientists (as well as your specialty) • Then, in time… • you can drop “data” because you will handle data as easily as you do field work, use instruments, etc…

More Related