190 likes | 200 Views
Learn statistical concepts, data types, and data table organization for research on drought resistance in Trembling Aspen. Understand how to ensure samples represent the population accurately and manage data effectively. Date types, data tables, and file management strategies are covered for efficient research.
E N D
Lecture 3. Statistical Vocabulary & data management Zihaohan Sang Sept 10, 2019
Week2 ! • Basic statistical vocab + data management • Exploratory graphics
Here is the distribution of lodgepole pine. Does these samples (from starts) represent the population?
Take home message: • make sure samples can fully represent the population you want to study; • To avoid uncertainty caused by random chance, more general the better.
Date types in R Numeric: Categorical: Discrete: Integer (1, 5, 100) Continuous: Integer + digits (1.1, 5.0, 100.3) Nominal: character or Factor (species, locations) Ordinal: Order factors (‘Good’, ‘Med’, ‘Poor’) levels: Poor < Med < Good Logical: True/False
Notes: • use as.factor() or as.numeric() to force a variable into the type you want; • read.csv() function would automatically read character column as factor (levels is alphabetically) • Add one or more letters into a column, R would automatically classify it as character or factor
Golden rules for data tables • A row represents a unit • All measurements of a unit should normally be in the same row. • Different units must be in different rows. • Important to think about what your units are
Golden rules for data tables 2. If in doubt, add more rows • If possible, use categorical (character) variables to indicate the independent effects (treatments, environments). • Repeat measurements are normally added as rows, with two independent variables “Time” and “Individual”. • It is always easy to convert a long table to a wide table (Excel Pivot), but not vice versa.
Golden rules for data tables 3. Use strong IDs
Golden rules for data tables 4. Modify your raw data entries with R scripts • Easy to do a change something and re-run the analysis (e.g. with or without outliers) • Hunting down and fixing errors is efficient, because script leaves a perfect trail of what you did. • Save yourself from repetitive tasks (that likely introduce errors)
Golden Rules - File Management • Keep all files you need for a particular analysis in one folder (.RData-shortcut, data.xls, data.csv, script.r, script.sas, documentation.txt) • New folders for new tasks, analysis (numbered and descriptive folder names are useful) • Use many folders but shallow folder hierarchy (2-4 subdirectories deep but many folders) • Zip previous folders (analysis steps) for backup