1 / 16

Managing and Curating Data

Managing and Curating Data. Chapter 8. Introduction. Data organization Data management Data curation Raw data is required to repeat a scientific study Any data supported by public funds is legally required to be available for other scientists and the public. Step 1: Managing Raw Data.

zenobia
Download Presentation

Managing and Curating Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing and Curating Data Chapter 8

  2. Introduction • Data organization • Data management • Data curation • Raw data is required to repeat a scientific study • Any data supported by public funds is legally required to be available for other scientists and the public

  3. Step 1: Managing Raw Data • Various sources of data • Data loggers • Handwritten notes • This data must be transferred to an organized format, checked and analyzed

  4. Spreadsheets • Row: single observation • Column: single measured or observed variable • Enter data ASAP! • Detect mistakes • Memory (doesn’t last long) • 2 copies • Timely analysis • Proofread the data • Check it 2006 Garden Yield

  5. Metadata: Data about data • “Must have” metadata: • Name and contact info of collector • Location of data collection • Name of study • Source of funding • Description of the organization of the data file • Methods used to collect • Types of experimental units • Description of abbreviations • Explicit description of data in columns and rows • May be created before in some cases • Very important to assemble because it’s easily forgotten

  6. Step 3: Checking the Data • Outliers: values of measurements or observations that are outside the range of the bulk of the data • Values beyond the upper or lower deciles (the 90% or the 10%) • Outliers increase the variance in data and increase the chance of a Type II error

  7. How to deal with outliers • Do not delete them; this could be considered fraud • Only delete if an error or the data no longer are valid • Think about them • Interesting hypotheses • A large body of science is devoted to outliers • What type of distribution does your data have?

  8. Errors and Missing Data • Errors are often outliers and can be identified • Sources: Mistyping (decimal points), instrument, field entry • Checking data can reduce errors • Never leave blank cells in spreadsheets; enter a zero or NA (not available)

  9. Detecting Outliers and Errors • Three techniques • Calculating column statistics • Checking ranges and precision of column values • Graphical exploratory data analysis

  10. Detecting Outliers and Errors cont. • Column stats: • Mean, median, standard deviation, variance • Logical functions to check your columns • Range checking your data

  11. Graphical Exploratory Data Analysis • Box plots (univariate) • Stem-and-leaf plots (univariate) • Scatterplots (bivariate or multivariate)

  12. Stem-and-leaf plots • Example: Vegetable biomass: 7,15, 35,36,37,23,27,21,42,55 0 7 1 5 2 1,3,7 3 5,6,7 4 2 5 5

  13. Scatter plots • Use to see how traits relate to one another

  14. Creating an Audit Trail • Examining data for outliers and errors is a QA/QC for research • Document how you perform QA/QC in your metadata • Your audit trail allows others to reanalyze and recreate your results • May be required for legal documentation

More Related