1 / 51

GRAPHS & STATS

GRAPHS & STATS. More on scatterplots Exporting data Overview of statistics T-tests. 20 September 2014 Sherubtse Training. HtWt Data. What kinds of interesting questions can we ask? What graphs would we make to answer them?.

burt
Download Presentation

GRAPHS & STATS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GRAPHS & STATS More on scatterplots Exporting data Overview of statistics T-tests 20 September 2014 Sherubtse Training

  2. HtWt Data What kinds of interesting questions can we ask?What graphs would we make to answer them? • Is there a difference in height between UWICE & SFS personnel? Does it differ for males vs. females? • Is there a difference in weight between UWICE & SFS personnel? Does it differ for males vs. females? • Is there a relationship between height and weight for UWICE personnel? How about for SFS personnel? • Is there a relationship between height and weight for males? How about for females?

  3. Just for fun...add a column of calculated data (BMI)and then summarize the data by SEX and INSTITUTE HtWt$BMI <- [equation]

  4. Create this scatterplot of heights vs. weights for just UWICE personnel Alternative format for plot(): plot(data=UWICE, kg~cm)

  5. Add a regression line to the UWICE scatterplotand determine if the linear relationship is significant lm() summary(lm) significant or not? add the regression line: abline(lm.UWICE,col="red") one way to add the p-value as text: (text(locator(1), "p = 0.0016", cex=1.5, col="red"))

  6. identify(UWICE$cm, UWICE$kg, labels=UWICE$sex, n=3) Add a thick dashed blue line at x = 170 to indicate which UWICE staff can receive special travel privileges (HINT: use ?par to figure out the plot arguments for setting line type & line width) What are the sexes of the 3 tall persons who will get special privileges? Use identify() to label

  7. Do the same with the SFS scatterplot...(is the relationship significant?)

  8. METHOD #1 Figure out the lower & upper limits for x- and y-axes 2) Plot the first (e.g., UWICE) data and regression line, setting xlim() and ylim() 3) Add the additional data, using points() 4) Add the lines (abline()), using colors matching each institution's points You can put all the data in a single graph with different institutions represented by different colors

  9. 'legend' is not an argument in plot(), so we add it as a separate line of code Add a Legend what does each of these arguments mean? legend (x="topleft", legend=c("UWICE","SFS"), fill=c("purple","blue"), inset=.02, bty="n")

  10. METHOD #2 Figure out the lower & upper limits for x- and y-axes 2) Plot the full data, setting xlim() and ylim() and 2 colors & pch values* 4) Add the lines (abline()), using colors matching each institution's points 5) Add the legend You can put all the data in a single graph with different institutions represented by different colors * col=c("blue","purple")[HtWt$institute], pch=c(16,15) [HtWt$institute]

  11. How would you change the legend boxes to match the points in the scatterplot (UWICE = circle, SFS = square)?

  12. Exporting Data To transfer a matrix or data frame via clipboard:write.table(HtsWts,"clipboard",sep="\t") then in excel, paste ...to a tab-delimited text file: write.table(HtsWts, "c:/mydata.txt", sep="\t")

  13. Intro to Statistics

  14. INFERENTIAL STATISTICS: Is the mean height of our sample a good measure of true population height? Population Parameter (e.g., Height) DESCRIPTIVE STATISTICS: What is the mean height of our sample of persons? Sample Statistic

  15. DESCRIPTIVE STATISTICS Our best estimate of the true population variability Mean Standard Deviation N Our best estimate of the true population mean Standard Error Confidence Interval

  16. Our best estimate of the true population variability Mean Standard Deviation N Our best estimate of the true population mean Standard Error Confidence Interval INFERENTIAL STATISTICS How good is our estimate (from the sample) of the true population mean?

  17. Summarize the data we have collected: • mean, median, mode • range, variance, standard deviation, interquartile range • graphical summaries of the data (e.g., histogram, boxplot) • Why do we need it? • It’s difficult to just look at raw data and understand what they mean Descriptive Statistics

  18. Use a sample of data to make conclusions and predictions about the population we sampled from Often used to determine if there are differences between populations or if a ‘treatment’ affected a population Why do we need it? We often don’t have the time or money to collect data from the entire population we are interested in. For inferential statistics, conclusions are only reliable if we sampled properly! Inferential Statistics

  19. Truth + Chance = Sample Statistic We use the sample data to make our best prediction about the population (the data we don't have), and then quantify the chance that we’re wrong (standard errors & confidence intervals) But no matter how fancy the statistics or how pretty the graphs, conclusions are only reliable if we sampled properly!

  20. What is a Normal Distribution?

  21. Does ‘Normal’ Exist?

  22. Does ‘Non-Normal’ Exist?

  23. When are Data Non-Normal? When multiple processes or populations are combined in a single data set... Heights of children aged 5 - 12

  24. When are Data Non-Normal? When the population has many values close to zero or some other natural limit...

  25. When are Data Non-Normal? When some extreme values skew the population... (here, also bounded by zero) COMPANY EXECUTIVES THE SUPER RICH

  26. When are Data Non-Normal? When the data follow a process that naturally generates non-normal distributions EXPONENTIAL DISTRIBUTION Population growth POISSON DISTRIBUTION Counts of rare events, e.g., accidents (lower bound of zero) BINOMIAL DISTRIBUTION Proportion (%) data

  27. What Can We Do With Non-Normal Data? • Check the data for errors; then • Transform data to approximate a normal distribution; OR • Apply nonparametric statistics

  28. Our best estimate of the true population variability Mean Standard Deviation N Our best estimate of the true population mean Standard Error Confidence Interval INFERENTIAL STATISTICS How good is our estimate (from the sample) of the true population mean?

  29. What is the Standard Error? • Standard deviation of the sample means • Tells us if the sample mean is a good estimate of the true population mean • Used to calculate the 95% confidence interval Standard Error (SE): sd / sqrt(n)

  30. What is the 95% Confidence Interval? • If we sample from the same population many times, 95% of the samples will have confidence intervals that include the true population parameter • The true population parameter (e.g., mean) is likely to be within the 95%CI of a sample (if the samples are unbiased). A large 95%CI tells us that our sample mean is not a very reliable estimate of the true mean. With large 95%CI's, it is hard to know from the samples whether or not two populations are truly different 95% Confidence interval: 1.96 X SE

  31. Are plant heights significantly different between control & fertilized treatments? Control 17.2 (95%CI 16.4 – 18.0) Fertilized 18.9 (95%CI 18.1 – 19.7) 17.2 (95%CI 14.6 – 19.8) N=30 =18.9 s=2.2 N=5 N=5 N=30 =17.2 s=2.1 18.9 (95%CI 16.2 – 21.6) SIGNIFICANTLY DIFFERENT NOT SIGNIFICANTLY DIFFERENT control fertilized

  32. Exploratory Data Analysis • Before jumping into statistical tests and • p-values, LOOK AT YOUR DATAin spreadsheets and graphs to identify: • data errors/outliers • if your data meet assumptions of parametric statistical tests • interesting patterns • Before you do any statistical tests, you should already have an idea what the results will be

  33. Errors in Data Collection / Entry • Decimal in wrong place • Same category spelled many ways • Data collected in different measurement units • Forgot to collect some data • Numbers typed incorrectly when transferred from paper (sloppy handwriting, etc.)

  34. OUTLIER: A data point that is much smaller or much larger than other data in the sample Why do we care? A few outliers can change the sample mean, increase the variance of sample data, and change the p-value of a parametric statistical test How do we find potential outliers in our data? • Look at data ranges, histograms& boxplots • For correlation & regression analyses, look at scatterplots

  35. For correlations/regressions, outliers may fall within the normal range of data...but plotting the scatterplot reveals outliers p = 0.06

  36. A single outlier can change the regression equationand the significance of the relationship p = 0.06 p = 0.001

  37. WHAT SHOULD I DO WITH OUTLIERS? NO Remove outliers before analyzing data YES Are data from the population of inference? Are data entered correctly? NO YES

  38. WHAT SHOULD I DO WITH TRUE OUTLIERS?* Transform data (if appropriate) Use nonparametric statistics Analyze data with and without outliers Report & discuss both results YES NO Keep outliers & report results Do study conclusions change? YES * ALWAYS KEEP GOOD RECORDS OF YOUR DATA EXPLORATION ACTIVITIES AND ANY CHANGES YOU MAKE TO THE ORIGINAL DATA! Remove outliers & report results, but discuss your justification for removing the outliers

  39. OUTLIERS It is wrong to remove outliers from analyses just because they don't fit with the other data! Outliers can tell us interesting information about a population—conduct more research to understand what causes these unusual data.

  40. Exploratory Data Analysis • Before jumping into statistical tests and • p-values, LOOK AT YOUR DATAin spreadsheets and graphs to identify: • data errors/outliers • if your data meet assumptions of parametric statistical tests • interesting patterns • Before you do any statistical tests, you should already have an idea what the results will be

  41. Do data come from a normally distributed population? • Sample data are assumed to represent the distribution of the population. Non-normal data are not 'wrong', they just represent processes that naturally generate other types of distributions. • With small sample sizes, it can be difficult to tell if data come from a normally distributed population. Consider what you know about the underlying process.

  42. Evaluating Normality Understand which processes generate non-normal data, then... • Visual assessment: • histograms • normal Q-Q plots • Normality tests: • Shapiro-Wilk (shapiro.test()) • Anderson-Darling (from pkg ‘nortest’) • Pearson chi-square (from pkg ‘nortest’) • Kolmogorov-Smirnov (from pkg ‘nortest’)

  43. Exploratory Data Analysis • Before jumping into statistical tests and • p-values, LOOK AT YOUR DATAin spreadsheets and graphs to identify: • data errors/outliers (for scatterplots, graph it!) • if your data meet assumptions of parametric statistical tests • interesting patterns • Before you do any statistical tests, you should already have an idea what the results will be

  44. T-Test

  45. For determining if population means are different Two-sample t-test Compare the means of two independent groups (don’t need same sample sizes), e.g., is the mean height of Bhutanese college men different from that of USA college men? Paired t-test Compare the means paired groups, e.g., is the mean weight of USA college men different before and after a 3-month exercise & diet program? T-Test

  46. U U U U U G G G G G • Is the height of shrubs different in grazed and ungrazed areas?

  47. TWO SAMPLE T-TEST STEP ONE: Look at the data! (Make the point plot) STEP TWO: Do the t-test: t.test (grazed, ungrazed)

More Related