1 / 37

Biostat 200 Introduction to Biostatistics

Biostat 200 Introduction to Biostatistics . Lecture 1. Course instructors. Judy Hahn, M.A., Ph.D. Judy.hahn@ucsf.edu (415) 206-4435 TAs Michelle Odden, Ph.D., M.S. Megumi Okumura, M.D. Maya Vijayaraghavan, M.D. Robin Wallace. M.D. The details. Lectures: Tuesdays 10:30-12:30

maximus
Download Presentation

Biostat 200 Introduction to Biostatistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biostat 200Introduction to Biostatistics

  2. Lecture 1

  3. Course instructors • Judy Hahn, M.A., Ph.D. • Judy.hahn@ucsf.edu • (415) 206-4435 • TAs Michelle Odden, Ph.D., M.S. Megumi Okumura, M.D. Maya Vijayaraghavan, M.D. Robin Wallace. M.D.

  4. The details • Lectures: Tuesdays 10:30-12:30 • Labs: Thursday 10:30-12 • Lab 1: Room CB 6702 • Lab 2: Room CB 6704 • Office hrs: Thursday 12-1 Room CB 5715 • Course credits: 3

  5. The details • Readings • Required readings will be from Principles of Biostatistics by M. Pagano and K. Gauvreau. Duxbury. 2nd edition. • Please read the assigned chapters before lecture, and review them after lecture

  6. The details • Assignments will be posted on Thursdays with due dates Sunday at 5 p.m. 1.5 weeks later • Data collection (Assignment 1 only) • Data analysis and interpretation • Exercises in the book • Reading and interpretation of scientific publications • You must attend Lab 1 to receive assignment 1

  7. The details • Grading: • Homework (75%) • 5 Assignments • Varying in length; each homework problem is worth (usually 10) points toward final homework score • Final exam (25%) • LATE ASSIGNMENTS WILL NOT BE ACCEPTED!!!

  8. Assigments • Send to your TAs • Lab 1: Megan Okumura, Robin Wallace ticr.biostat200.1@gmail.com • Lab 2: Michelle Odden, Maya Vijayaraghavan ticr.biostat200.2@gmail.com

  9. What I do and why

  10. Course goals • Familiarity with basic biostatistics terms and nomenclature • Ability to summarize data and do basic statistical analyses using STATA • Ability to understand basis statistical analyses in published journals • Understanding of key concepts including statistical hypothesis testing – critical quantitative thinking • Foundation for more advance analyses

  11. Today’s topics • Variables- numerical versus categorical • Tables (frequencies) • Graphs (histograms, box plots, scatter plots, line graphs) • Required reading: Pagano Chapter 2

  12. Types of data • Data are made up of a set of variables • Categorical variables: any variable that is not numerical (values have no numerical meaning) (e.g. gender, race, drug, disease status) • Nominal variables • Ordinal variables Pagano and Gauvreau, Chapter 2

  13. Types of data • Categorical variables • Nominal variables: • The data are unordered (e.g. RACE: 1=Caucasian, 2=Asian American, 3=African American) • A subset of these variables are Binary or dichotomous variables: have only two categories (e.g. GENDER: 1=male, 2=female) • Ordinal variables: • The data are ordered (e.g. AGE: 1=10-19 years, 2=20-29 years, 3=30-39 years; likelihood of participating in a vaccine trial) Pagano and Gauvreau, Chapter 2

  14. Types of data • Numerical (quantitative) variables: naturally measured as numbers for which meaningful arithmetic operations make sense (e.g. height, weight, age, salary, viral load, CD4 cell counts) • Discrete variables: can be counted (e.g. number of children in household: 0, 1, 2, 3, etc.) • Continuous variables: can take any value within a given range (e.g. weight: 2974.5 g, 3012.6 g) Pagano and Gauvreau, Chapter 2

  15. Types of data • Manipulation of variables • Continuous variables can be discretized • E.g., age can be rounded to whole numbers • Continuous or discrete variables can be categorized • E.g., age categories • Categorical variables can be re-categorized • E.g., lumping from 5 categories down to 2 Pagano and Gauvreau, Chapter 2

  16. Frequency tables • Categorical variables are summarized by • Frequency counts – how many are in each category • Relative frequency or percent (a number from 0 to 100) • Or proportion (a number from 0 to 1) Pagano and Gauvreau, Chapter 2

  17. Frequency tables • Continuous variables can categorized in meaningful ways • Choice of cutpoints • Even intervals • Meaningful cutpoints related to a health outcome or decision • Equal percentage of the data falling into each category Pagano and Gauvreau, Chapter 2

  18. Frequency tables Pagano and Gauvreau, Chapter 2

  19. Bar charts • General graph for categorical variables • Graphical equivalent of a frequency table • The x-axis does not have to be numerical Pagano and Gauvreau, Chapter 2

  20. Histograms • Bar chart for numerical data – The number of bins and the bin width will make a difference in the appearance of this plot and may affect interpretation histogram cd4count, fcolor(blue) lcolor(black) width(50) name(cd4_by50) title(CD4 among new HIV positives at Mulago) xtitle(CD4 cell count) percent Pagano and Gauvreau, Chapter 2

  21. Histograms • This histogram has less detail but gives us the % of persons with CD4 <350 cells/mm3 histogram cd4count, fcolor(blue) lcolor(black) width(350) name(cd4_by350) title(CD4 among new HIV positives at Mulago) xtitle(CD4 cell count) percent Pagano and Gauvreau, Chapter 2

  22. What does this graph tell us?

  23. Box plots • Middle line=median (50th percentile) • Middle box=25th to 75th percentiles (interquartile range) • Bottom whisker: Data point at or above 25th percentile – 1.5*IQR • Top whisker: Data point at or below 75th percentile + 1.5*IQR Pagano and Gauvreau, Chapter 2

  24. Box plots graph box cd4count, box(1, fcolor(blue) lcolor(black) fintensity(inten100)) title(CD4 count among new HIV positives at Mulago) Pagano and Gauvreau, Chapter 2

  25. Box plots by another variable • We can divide up our graphs by another variable • What type of variable is gender?

  26. Histograms by another variable

  27. Numerical variable summaries • Mode – the value (or range of values) that occurs most frequently • Sometimes there is more than one mode, e.g. a bi-modal distribution (both modes do not have to be the same height) • The mode only makes sense when the values are discrete, rounded off, or binned Pagano and Gauvreau, Chapter 3

  28. Scatter plots Pagano and Gauvreau, Chapter 2

  29. The importance of good graphs http://niemann.blogs.nytimes.com/2009/09/14/good-night-and-tough-luck/

  30. Numerical variable summaries • Measures of central tendency – where is the center of the data? • Median – the 50th percentile == the middle value • If n is odd: the median is the (n+1)/2 observations (e.g. if n=31 then median is the 16th highest observation) • If n is even: the median is the average of the two middle observations (e.g. if n=30 then the median is the average of the 15th and16th observation • Median CD4 cell count in previous data set = 234.5 Pagano and Gauvreau, Chapter 3

  31. Numerical variable summaries • Range • Minimum to maximum or difference (e.g. age range 15-58 or range=43) • CD4 cell count range: (0-1368) • Interquartile range (IQR) • 25th and 75th percentiles (e.g. IQR for age: 23-36) or difference (e.g. 13) • Less sensitive to extreme values • CD4 cell count IQR: (92-422) Pagano and Gauvreau, Chapter 3

  32. Numerical variable summaries • Measures of central tendency – where is the center of the data? • Mean – arithmetic average • Means are sensitive to very large or small values • Mean CD4 cell count: 296.9 • Mean age: 32.5 Pagano and Gauvreau, Chapter 3

  33. Interpreting the formula • ∑ is the symbol for the sum of the elements immediately to the right of the symbol • These elements are indexed (i.e. subscripted) with the letter i • The index letter could be any letter, though i is commonly used) • The elements are lined up in a list, and the first one in the list is denoted as x1 , the second one is x2 , the third one is x3 and the last one is xn . • n is the number of elements in the list. Pagano and Gauvreau, Chapter 3

  34. Numerical variable summaries • Sample variance • Amount of spread around the mean, calculated in a sample by • Sample standard deviation (SD) is the square root of the variance • The standard deviation has the same units as the mean • SD of CD4 cell count = 255.4 • SD of Age = 11.2 Pagano and Gauvreau, Chapter 3

  35. Numerical variable summaries • Coefficient of variation • For the same relative spread around a mean, the variance will be larger for a larger mean • Can use to compare variability across measurements that are on a different scale (e.g. IQ and head circumference) • CV for CD4 cell count: 86.0% • CV for age: 34.5% Pagano and Gauvreau, Chapter 3

  36. Pocket/wallet change • Histogram , boxplot • Mode, Median, 25th percentile, 75th percentile • Mean, SD • Differ by gender?

  37. For next time • Read Pagano and Gauvreau • Chapters 1-3 (Review of today’s material) • Chapter 6

More Related