1 / 50

Statistics Primer

Statistics Primer. Xiayu (Stacy) Huang Bioinformatics Shared Resource Email: bsr_help@sanfordburnham.org Sanford | Burnham Medical Research Institute. Outline. Overview of basic statistics Introduction Descriptive statistics Inferential statistics

ermin
Download Presentation

Statistics Primer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics Primer Xiayu (Stacy) Huang Bioinformatics Shared Resource Email: bsr_help@sanfordburnham.org Sanford | Burnham Medical Research Institute

  2. Outline • Overview of basic statistics • Introduction • Descriptive statistics • Inferential statistics • Most common statistical test and its applications • T test • Power analysis using t test

  3. What is statistics? • On American Statistical Association (ASA) website, statistics is defined as the science of collection, analysis, interpretation and presentation of data • Using Statistics to make decision can be a double-edged sword • In the 1980s, Marriott conducted an extensive survey with potential customers on their attitudes about current hotel offerings. After analyzing the data, the company launched Courtyard by Marriott, which has been a huge success • Coca-Cola performed a major consumer study in 1985 and, based on the results, decided to reformulate Coke, its flagship drink. After a huge public outcry, Coca-Cola had to backtrack and bring the original formulation back to market

  4. History of statistics • 17th-18th century • 19th century • 20th century • Bernoulli number • Bernoulli trial • Bernoulli process • Bayes theorem Thomas Bayes Jakob Bernoulli • Gaussian distribution Carl Friedrich Gauss • Karl Pearson • Pearson correlation • Chi-square distribution • Ronald Aylmer Fisher • ANOVA, maximum likelihood • William Gosset • Student’s t

  5. Why statistics is important to biologists? • Designing experiment • Analyzing biological data and understanding analysis results • Preparing manuscript and grant applications How many ??? How many replicates for my microarray exp??? Identifying outlier Normalization/transformation Statistical test, etc. DEGs No replicates=No statistics?

  6. Study Scheme Study Hypothesis Design Study Conduct Study and Collect data Data Analysis Choose Statistical Test Summarizing data using Descriptive Statistics Hypothesis Testing Using Inferential Statistics Compute test statistic Compute p-value Compare p-value and α Make Conclusions

  7. Branches of statistics Descriptive statistics (Summary statistics) • Summarize data graphically or numerically • Lead to hypothesis generating Inferential statistics • Distinguish true difference from random variation • Allow hypothesis testing

  8. Types of data

  9. Descriptive statistics—central tendency • Mean—average • Median—middle value of sorted data • Mode—most frequently observed value Age i.e. Mean=(24+27+….+24)/13=24.8 Median Mode is 24 with frequency of 3

  10. Descriptive statistics—dispersion • Range • Sample Variance (s2)\ Standard deviation (s) • Values beyond two standard deviations from the mean can be considered as “outliers” (>mean+2s=24.8+2x2.2=29.2 or <mean-2s=24.8-2x2.2=20.4) • Standard error of mean (SEM) Age i.e. Range=highest value-lowest value=29-22=7

  11. Descriptive statistics—data distribution • Histogram (x-bin, y-frequency) • Graphical representation showing the distribution of data • Summary graph showing how many data points falling in various ranges Frequency table Histogram\frequency distribution Percentage table Histogram\probability distribution

  12. Descriptive statistics—data distribution • Different data distributions Approximate normal distribution i.e. height of people, length of dogs Right skewed distribution Left skewed distribution i.e. FC of Microarray data i.e. distribution of age at retirement

  13. Normal (or Gaussian) distribution mean=median=mode • Bell-shaped curve • Symmetrical about mean • Mean, median and mode are equal • ~68% data points fall within 1 sd of mean • ~95% data points fall within 2 sd of mean • ~99.7% data points fall within 3 sd of mean

  14. Installing graphpad prism You can install Prism on Institute supplied computers, including home and personal computers. http://graphpad.com/paasl/index.cfm?sitecode=burnhm SERIAL NUMBERS: Macintosh version contacting IT (support@sanfordburnham.org) to get serial number Windows version contacting IT (support@sanfordburnham.org) to get serial number

  15. Calculating descriptive statistics in excel

  16. Calculating descriptive statistics in prism

  17. Calculating descriptive statistics in prism

  18. Graphically displaying descriptive statistics • Histogram • Mean error bar plot • Line plot w/o error bar

  19. Graphically displaying descriptive statistics in Prism Histogram and frequency distribution Mean error bar plot

  20. Graphically displaying descriptive statistics in Prism Group line plot Group line plot without error bar Group line plot with error bar

  21. Choosing right measures of descriptive statistics Normal distribution Skewed distribution Normal distribution: mean and standard deviation Skewed distribution: transform data to normal distribution

  22. Outline • Overview of basic statistics • Brief Introduction • Descriptive statistics • Inferential statistics • Most common statistical tests and its applications • T test • Power analysis using t test

  23. Inferential statistics Parametric • Interval or ratio measurements • Continuous variable • Usually assuming data are normally distributed Nonparametric • Ordinal or nominal measurements • Discreet variables • Making no assumption about how data is distributed

  24. Inferential statistics-hypothesis Null hypothesis (H0) Alternative hypothesis (HA) • is the opposite of null hypothesis • is generally the hypothesis that is believed to be true by the researcher new drug effect = old drug effect tumor growth of MT = tumor growth of WT new drug effect ≠ or > old drug effect tumor growth of MT ≠ or < tumor growth of WT

  25. Inferential statistics-one and two sided tests • Hypothesis tests can be one or two sided (tailed) • One sided tests are directional: • Two sided tests are not directional: H0 : new drug effect ≤ old drug effect HA : new drug effect > old drug effect H0 : new drug effect = old drug effect HA : new drug effect ≠ old drug effect

  26. Inferential statistics-type I and type II errors “Actual situation” No difference (H0) Difference (HA) No difference “Measured” Difference FOB screening(bowel cancer) “Actual situation” - + - + 1830 200 “Measured” 2000 30

  27. Inferential statistics-type I and type II errors • Control type I and type II errors • Inverse relationship between type I and type II errors • Make a choice to control which error • i.e. controlling type I error (FP) is more important for microarray data than type II error (FN) • i.e. controlling type II error (FN) is more important for cancer screening test than type I error (FP) • Choose type I and type II errors for statistical test? • Common choices (α = 5%, β = 20%) • Exploratory study (α = 10%, β = 10%) • Confirmatory study (α = 1%, β = 10%)

  28. Inferential Statistics-P-value • the probability that an observed difference could have occurred by chance under null hypothesis • Computed from test statistics score • P-value is the same as false positive rate • P-value below cut off (α) is referred as “statistically significant”

  29. Inferential Statistics-Power Power (1-β, aka true positive rate (TP)) • Probability of detecting a significant scientific difference when it does exist Power depends on: • Sample size (n) • Standard deviation (s) • Size of the difference you want to detect (δ) • False positive rate (α) Effect size

  30. Study scheme Study Hypothesis Design Study Conduct Study and Collect data Data Analysis Choose Statistical Test Calculating and Displaying Descriptive Statistics Hypothesis Testing Using Inferential Statistics Compute test statistic Compute p-value Compare p-value and α Make Conclusions

  31. How to choose an appropriate statistical test? • Type of data • Quantitative • Qualitative • Type of research question • Association • Correlation • Comparison • Data structure • Independent • Paired • Matched

  32. Statistical test decision making tree For qualitative or non-numerical data For quantitative or numerical data

  33. Statistical test decision making tree Relationship between variables Two sample comparison Multiple sample comparison

  34. Outline • Overview of basic statistics • Brief Introduction • Descriptive statistics • Inferential statistics • Most common statistical test and its applications • T test • Power analysis using t test

  35. Student’s t test Guinness employee William Sealy Gosset published the 'Student's t-test' in 1908

  36. Types of t test • One sample t test: test if a sample mean differs significantly from the given known mean • Unpaired t test: test if two independent sample means differ significantly • Paired t test: test if two dependent sample means differ significantly (mean of pre and post treatment for same set of patients

  37. Application of t test in biology Mincroarry experiment WT MT Proteomics experiment WT MT Biological reps Technical reps • You need to have at least two replicates in each condition • to do t test, otherwise, t test is invalid and you won’t have statistics

  38. Two sample unpaired t test • Assumptions • Data is approximately normally distributed • The sample has been independently and randomly selected • Similar variances between comparing groups • Hypothesis (two sided or one sided) • Test statistics -- sample means -- population means -- sample standard deviation -- sample size -- pooled sample variance

  39. Sample data • 1st Question to be answered: • Will the two treatments have different effect on patients’ remission time from cancer?

  40. Summarizing sample data using descriptive statistics

  41. Hypothesis testing of sample data using inferential statistics Step1: Choosing an appropriate statistical test Step2: Performing statistical test in software Step3: Making conclusions

  42. Statistical test decision making tree

  43. Two sample t test in Prism-normality check

  44. Two sample t test in Prism

  45. Two sample t test in excel

  46. Power analysis using two sample t test 2nd question to be answered: How many patients do we need in order to detect a significantly difference b/w two treatments? N αβδ/s Test K:1 efficiency imbalance     

  47. Power analysis of t test in G*power

  48. Power analysis of t test in G*power

  49. Basic Statistics tools Statistics softwares and packages: 1.Excel and add-ins: EZAnalyze, Analysis Toolpak 2. Our institute supported Prism 3. SPSS, Statistica (commercial) 4. SAS (commercial) and R 5. G*Power Basic statistics books: 1. Intro Stats, SDSU, 2nd edition, Deveaux, Velleman, Bock 2.Choosing and Using Statistics: A Biologist's Guide 3. Introduction to Statistics for Biology 4. Biostatistical analysis, fifth edition, Jerrold H. Zar Statistics videos: 1. http://www.microbiologybytes.com/maths/videos 2. http://www.youtube.com: descriptive statistics, basic statistics, install 2007 Excel data analysis add-ins…

  50. Next..... • My presentation will be posted on website: http://bsrweb.burnham.org/ • I am located in building 10, Office 2405, ext 3916 • Feel free to come or call or send e-mail to ask questions (xyhuang@sanfordburnham.org) • Group email: bsr_help@sanfordburnham.org

More Related