160 likes | 191 Views
Managing and Curating Data. Chapter 8. Introduction. Data organization Data management Data curation Raw data is required to repeat a scientific study Any data supported by public funds is legally required to be available for other scientists and the public. Step 1: Managing Raw Data.
E N D
Managing and Curating Data Chapter 8
Introduction • Data organization • Data management • Data curation • Raw data is required to repeat a scientific study • Any data supported by public funds is legally required to be available for other scientists and the public
Step 1: Managing Raw Data • Various sources of data • Data loggers • Handwritten notes • This data must be transferred to an organized format, checked and analyzed
Spreadsheets • Row: single observation • Column: single measured or observed variable • Enter data ASAP! • Detect mistakes • Memory (doesn’t last long) • 2 copies • Timely analysis • Proofread the data • Check it 2006 Garden Yield
Metadata: Data about data • “Must have” metadata: • Name and contact info of collector • Location of data collection • Name of study • Source of funding • Description of the organization of the data file • Methods used to collect • Types of experimental units • Description of abbreviations • Explicit description of data in columns and rows • May be created before in some cases • Very important to assemble because it’s easily forgotten
Step 3: Checking the Data • Outliers: values of measurements or observations that are outside the range of the bulk of the data • Values beyond the upper or lower deciles (the 90% or the 10%) • Outliers increase the variance in data and increase the chance of a Type II error
How to deal with outliers • Do not delete them; this could be considered fraud • Only delete if an error or the data no longer are valid • Think about them • Interesting hypotheses • A large body of science is devoted to outliers • What type of distribution does your data have?
Errors and Missing Data • Errors are often outliers and can be identified • Sources: Mistyping (decimal points), instrument, field entry • Checking data can reduce errors • Never leave blank cells in spreadsheets; enter a zero or NA (not available)
Detecting Outliers and Errors • Three techniques • Calculating column statistics • Checking ranges and precision of column values • Graphical exploratory data analysis
Detecting Outliers and Errors cont. • Column stats: • Mean, median, standard deviation, variance • Logical functions to check your columns • Range checking your data
Graphical Exploratory Data Analysis • Box plots (univariate) • Stem-and-leaf plots (univariate) • Scatterplots (bivariate or multivariate)
Stem-and-leaf plots • Example: Vegetable biomass: 7,15, 35,36,37,23,27,21,42,55 0 7 1 5 2 1,3,7 3 5,6,7 4 2 5 5
Scatter plots • Use to see how traits relate to one another
Creating an Audit Trail • Examining data for outliers and errors is a QA/QC for research • Document how you perform QA/QC in your metadata • Your audit trail allows others to reanalyze and recreate your results • May be required for legal documentation