510 likes | 600 Views
GRAPHS & STATS. More on scatterplots Exporting data Overview of statistics T-tests. 20 September 2014 Sherubtse Training. HtWt Data. What kinds of interesting questions can we ask? What graphs would we make to answer them?.
E N D
GRAPHS & STATS More on scatterplots Exporting data Overview of statistics T-tests 20 September 2014 Sherubtse Training
HtWt Data What kinds of interesting questions can we ask?What graphs would we make to answer them? • Is there a difference in height between UWICE & SFS personnel? Does it differ for males vs. females? • Is there a difference in weight between UWICE & SFS personnel? Does it differ for males vs. females? • Is there a relationship between height and weight for UWICE personnel? How about for SFS personnel? • Is there a relationship between height and weight for males? How about for females?
Just for fun...add a column of calculated data (BMI)and then summarize the data by SEX and INSTITUTE HtWt$BMI <- [equation]
Create this scatterplot of heights vs. weights for just UWICE personnel Alternative format for plot(): plot(data=UWICE, kg~cm)
Add a regression line to the UWICE scatterplotand determine if the linear relationship is significant lm() summary(lm) significant or not? add the regression line: abline(lm.UWICE,col="red") one way to add the p-value as text: (text(locator(1), "p = 0.0016", cex=1.5, col="red"))
identify(UWICE$cm, UWICE$kg, labels=UWICE$sex, n=3) Add a thick dashed blue line at x = 170 to indicate which UWICE staff can receive special travel privileges (HINT: use ?par to figure out the plot arguments for setting line type & line width) What are the sexes of the 3 tall persons who will get special privileges? Use identify() to label
Do the same with the SFS scatterplot...(is the relationship significant?)
METHOD #1 Figure out the lower & upper limits for x- and y-axes 2) Plot the first (e.g., UWICE) data and regression line, setting xlim() and ylim() 3) Add the additional data, using points() 4) Add the lines (abline()), using colors matching each institution's points You can put all the data in a single graph with different institutions represented by different colors
'legend' is not an argument in plot(), so we add it as a separate line of code Add a Legend what does each of these arguments mean? legend (x="topleft", legend=c("UWICE","SFS"), fill=c("purple","blue"), inset=.02, bty="n")
METHOD #2 Figure out the lower & upper limits for x- and y-axes 2) Plot the full data, setting xlim() and ylim() and 2 colors & pch values* 4) Add the lines (abline()), using colors matching each institution's points 5) Add the legend You can put all the data in a single graph with different institutions represented by different colors * col=c("blue","purple")[HtWt$institute], pch=c(16,15) [HtWt$institute]
How would you change the legend boxes to match the points in the scatterplot (UWICE = circle, SFS = square)?
Exporting Data To transfer a matrix or data frame via clipboard:write.table(HtsWts,"clipboard",sep="\t") then in excel, paste ...to a tab-delimited text file: write.table(HtsWts, "c:/mydata.txt", sep="\t")
INFERENTIAL STATISTICS: Is the mean height of our sample a good measure of true population height? Population Parameter (e.g., Height) DESCRIPTIVE STATISTICS: What is the mean height of our sample of persons? Sample Statistic
DESCRIPTIVE STATISTICS Our best estimate of the true population variability Mean Standard Deviation N Our best estimate of the true population mean Standard Error Confidence Interval
Our best estimate of the true population variability Mean Standard Deviation N Our best estimate of the true population mean Standard Error Confidence Interval INFERENTIAL STATISTICS How good is our estimate (from the sample) of the true population mean?
Summarize the data we have collected: • mean, median, mode • range, variance, standard deviation, interquartile range • graphical summaries of the data (e.g., histogram, boxplot) • Why do we need it? • It’s difficult to just look at raw data and understand what they mean Descriptive Statistics
Use a sample of data to make conclusions and predictions about the population we sampled from Often used to determine if there are differences between populations or if a ‘treatment’ affected a population Why do we need it? We often don’t have the time or money to collect data from the entire population we are interested in. For inferential statistics, conclusions are only reliable if we sampled properly! Inferential Statistics
Truth + Chance = Sample Statistic We use the sample data to make our best prediction about the population (the data we don't have), and then quantify the chance that we’re wrong (standard errors & confidence intervals) But no matter how fancy the statistics or how pretty the graphs, conclusions are only reliable if we sampled properly!
When are Data Non-Normal? When multiple processes or populations are combined in a single data set... Heights of children aged 5 - 12
When are Data Non-Normal? When the population has many values close to zero or some other natural limit...
When are Data Non-Normal? When some extreme values skew the population... (here, also bounded by zero) COMPANY EXECUTIVES THE SUPER RICH
When are Data Non-Normal? When the data follow a process that naturally generates non-normal distributions EXPONENTIAL DISTRIBUTION Population growth POISSON DISTRIBUTION Counts of rare events, e.g., accidents (lower bound of zero) BINOMIAL DISTRIBUTION Proportion (%) data
What Can We Do With Non-Normal Data? • Check the data for errors; then • Transform data to approximate a normal distribution; OR • Apply nonparametric statistics
Our best estimate of the true population variability Mean Standard Deviation N Our best estimate of the true population mean Standard Error Confidence Interval INFERENTIAL STATISTICS How good is our estimate (from the sample) of the true population mean?
What is the Standard Error? • Standard deviation of the sample means • Tells us if the sample mean is a good estimate of the true population mean • Used to calculate the 95% confidence interval Standard Error (SE): sd / sqrt(n)
What is the 95% Confidence Interval? • If we sample from the same population many times, 95% of the samples will have confidence intervals that include the true population parameter • The true population parameter (e.g., mean) is likely to be within the 95%CI of a sample (if the samples are unbiased). A large 95%CI tells us that our sample mean is not a very reliable estimate of the true mean. With large 95%CI's, it is hard to know from the samples whether or not two populations are truly different 95% Confidence interval: 1.96 X SE
Are plant heights significantly different between control & fertilized treatments? Control 17.2 (95%CI 16.4 – 18.0) Fertilized 18.9 (95%CI 18.1 – 19.7) 17.2 (95%CI 14.6 – 19.8) N=30 =18.9 s=2.2 N=5 N=5 N=30 =17.2 s=2.1 18.9 (95%CI 16.2 – 21.6) SIGNIFICANTLY DIFFERENT NOT SIGNIFICANTLY DIFFERENT control fertilized
Exploratory Data Analysis • Before jumping into statistical tests and • p-values, LOOK AT YOUR DATAin spreadsheets and graphs to identify: • data errors/outliers • if your data meet assumptions of parametric statistical tests • interesting patterns • Before you do any statistical tests, you should already have an idea what the results will be
Errors in Data Collection / Entry • Decimal in wrong place • Same category spelled many ways • Data collected in different measurement units • Forgot to collect some data • Numbers typed incorrectly when transferred from paper (sloppy handwriting, etc.)
OUTLIER: A data point that is much smaller or much larger than other data in the sample Why do we care? A few outliers can change the sample mean, increase the variance of sample data, and change the p-value of a parametric statistical test How do we find potential outliers in our data? • Look at data ranges, histograms& boxplots • For correlation & regression analyses, look at scatterplots
For correlations/regressions, outliers may fall within the normal range of data...but plotting the scatterplot reveals outliers p = 0.06
A single outlier can change the regression equationand the significance of the relationship p = 0.06 p = 0.001
WHAT SHOULD I DO WITH OUTLIERS? NO Remove outliers before analyzing data YES Are data from the population of inference? Are data entered correctly? NO YES
WHAT SHOULD I DO WITH TRUE OUTLIERS?* Transform data (if appropriate) Use nonparametric statistics Analyze data with and without outliers Report & discuss both results YES NO Keep outliers & report results Do study conclusions change? YES * ALWAYS KEEP GOOD RECORDS OF YOUR DATA EXPLORATION ACTIVITIES AND ANY CHANGES YOU MAKE TO THE ORIGINAL DATA! Remove outliers & report results, but discuss your justification for removing the outliers
OUTLIERS It is wrong to remove outliers from analyses just because they don't fit with the other data! Outliers can tell us interesting information about a population—conduct more research to understand what causes these unusual data.
Exploratory Data Analysis • Before jumping into statistical tests and • p-values, LOOK AT YOUR DATAin spreadsheets and graphs to identify: • data errors/outliers • if your data meet assumptions of parametric statistical tests • interesting patterns • Before you do any statistical tests, you should already have an idea what the results will be
Do data come from a normally distributed population? • Sample data are assumed to represent the distribution of the population. Non-normal data are not 'wrong', they just represent processes that naturally generate other types of distributions. • With small sample sizes, it can be difficult to tell if data come from a normally distributed population. Consider what you know about the underlying process.
Evaluating Normality Understand which processes generate non-normal data, then... • Visual assessment: • histograms • normal Q-Q plots • Normality tests: • Shapiro-Wilk (shapiro.test()) • Anderson-Darling (from pkg ‘nortest’) • Pearson chi-square (from pkg ‘nortest’) • Kolmogorov-Smirnov (from pkg ‘nortest’)
Exploratory Data Analysis • Before jumping into statistical tests and • p-values, LOOK AT YOUR DATAin spreadsheets and graphs to identify: • data errors/outliers (for scatterplots, graph it!) • if your data meet assumptions of parametric statistical tests • interesting patterns • Before you do any statistical tests, you should already have an idea what the results will be
For determining if population means are different Two-sample t-test Compare the means of two independent groups (don’t need same sample sizes), e.g., is the mean height of Bhutanese college men different from that of USA college men? Paired t-test Compare the means paired groups, e.g., is the mean weight of USA college men different before and after a 3-month exercise & diet program? T-Test
U U U U U G G G G G • Is the height of shrubs different in grazed and ungrazed areas?
TWO SAMPLE T-TEST STEP ONE: Look at the data! (Make the point plot) STEP TWO: Do the t-test: t.test (grazed, ungrazed)