300 likes | 498 Views
Using DCO Data ( Infrastructure , Management , Analysis, Visualization, …). Data Science. Peter Fox @ taswegian , pfox@cs.rpi.edu (Marshall Ma) and the Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT.
E N D
Using DCO Data (Infrastructure, Management, Analysis, Visualization, …) Data Science Peter Fox @taswegian, pfox@cs.rpi.edu (Marshall Ma) and the Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT https://deepcarbon.net/group/dco-summer-school-2014
Deep Carbon Observatory Global community of ‘Carbon’ scientists (~1000 from ~40 countries) contributing to aDeep Earth Computer (data legacy) comprising: • Global Earth Mineral Laboratory • Global Census of Deep Fluids • Global Volcano Gas Emissions • Global Census of Deep Microbial Life • Global State of High Pressure and Temperature Carbon and Related Materials • Global Inventory of Diamonds with Inclusions • …
Data Science is … • Doing science with someone else’s data … • across datasets • with models • multi-dimensional, multi-scale, multi-mode • complex data-types • needing new analytic and visual approaches • Especially in multiple “dimensions” (functional) • E.g. Detection/ attribution methods/ algorithms • Visual exploration Data Science
Value and units? Physical quantity versus measured as quantity Reference frame? Reference units? Value and units?
Use case: How DCO Finds Out About Data • Importing tool • A data repository • Repository staff/ • Data librarian • Transformed data ready for import • Internet • A data manager transforming data • Data • Spreadsheet • Diagram • Digital Map • Report • A scientist bringing new data (Fleischer, 2011)
Data-Information-Knowledge “Ecosystem” Producers Consumers Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation Context
Producers Consumers Quality Control Quality Assessment Fitness for Purpose Fitness for Use Trustor Trustee
Spreadsheets • E.g. Excel – import data
Census of Deep Life • Substantial metadata – how to visualize THIS?
To incline to one side; to give a particular direction to; to influence; to prejudice; to prepossess. [1913 Webster] • A partiality that prevents objective consideration of an issue or situation [syn: prejudice, preconception] • For acquisition – sampling bias is your enemy • Cognitive bias is (due to) YOU!
Provenance* • Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility • Internal • External
How you find DCO data…? • http://deepcarbon.net/dco_datasets • Will soon be a window into community-based sources • http://metpetdb.rpi.edu • http://earthchem.org/ • http://www.earthchem.org/petdb • http://vamps.mbl.edu/portals/deep_carbon/cdl.php • …
E.g. Deep Life (CoDL) New tools: R (statistics, visualization, modeling), D3.js (visualization) NOT just of the data, but of all types of information, knowledge! iPython Notebooks?
When You Use Data – Science 2.0 • Version/ subsetting and converting to a format you are familiar with is very common but mysterious • Take notes – document – provenance • Software – what did you use and how? • Derived products – what did you create, how, why, etc. • Use the metadata every chance you get, e.g. filenames! • Place them in a Web-accessible folder, consider getting an identifier • Use social media, blogs, etc. to discuss it..
Exercise 1 • Search for and access a dataset that you are not familiar with: • Can you read it? • Can you make sense of it? • Can you assess quality, uncertainty? • Any sources of bias? • What would you need to do to make it useful?
When You Generate Data – Science 2.0 • How the data was generated, why, for what, when and in what format • Take notes – document – provenance • Software – what did you use and how? • Derived products – what did you create, how, why, etc. • Use the metadata every chance you get, e.g. filenames! • Place them in a Web-accessible folder, consider getting an identifier • Use social media, blogs, etc. to discuss it..
Make it visible to DCO (can be private)https://deepcarbon.net/dco/dco-open-access-and-data-policieshttps://deepcarbon.net/page/submit-community-data You get an identifier! DCO-ID, can be cited, rewarded and much more… Share…
DCO checklist: what people have to do (courtesy UC3) • Your data management plan • Funding agency requirements • Domain Scientist Curation Services & Tools • Creating your data • Data manager • Data Scientist • Organizing your data Data Science • Repository staff • Managing your data • Sharing your data • Domain scientists often also take up these two roles, • which however is not efficient and effective (i.e., the 80-20 rule).
DCO checklist: a service & tool perspective + • Your data management plan • AP Sloan requirements+ Use cases, info. model • Object Modeling e.g., NSF New Proposal and Award Policies and Procedures Guide(effective January 14, 2013) • Creating your data Schema.org, etc. • CharacterizationServices • IdentityServices DCO-ID (Handle+DOI) • Organizing your data • CKAN, community StorageServices • Managing your data • IngestServices • CKAN, community • Faceted search and Drupal etc. • Discovery Service • Sharing your data • AccessServices • Linked Data, community
Exercise 2 • Begin with a recent dataset that you generated or we’re involved in generating • Can someone else read it? • Can someone make sense of it? • Have you asserted quality, uncertainty? • Have you described known sources of bias? • What else would you now do to make it more useful?
Further reading • Data Science course at RPI: http://tw.rpi.edu/web/Courses/DataScience/2013 • Fourth Paradigm: http://research.microsoft.com/en-us/collaboration/fourthparadigm/ • Data Management Planning tools: • http://tw.rpi.edu/web/project/DCO-DS/WorkingGroups/DMP • http://www.iedadata.org/compliance/plan • https://dmp.cdlib.org/
Breakout Session Today • Exercises 1 and 2 • Discussion
Friday • Marshall (Xiaogang) Ma will round out the data discussion • DCO goal for data: in the interim, • help you become data scientists (as well as your specialty) • Then, in time… • you can drop “data” because you will handle data as easily as you do field work, use instruments, etc…