550 likes | 831 Views
Medical Biometry I (Biostatistics 511) Instructor: Jim Hughes. Cartoons and images in these notes are from Gonick L. Cartoon Guide to Statistics. HarperPerennial, New York, 1993. Fisher L and vanBelle G. Biostatistics: A Methodology for the Health Sciences. Wiley, New York, 1993.
E N D
Medical Biometry I (Biostatistics 511) Instructor: Jim Hughes Cartoons and images in these notes are from Gonick L. Cartoon Guide to Statistics. HarperPerennial, New York, 1993. Fisher L and vanBelle G. Biostatistics: A Methodology for the Health Sciences. Wiley, New York, 1993 Biostat 511
Typical Public Health,Medical or Biological Questions About Populations • Does formula feeding increase the chance of survival of infants born to HIV positive mothers, compared to breastfeeding, in a developing country? • How do we estimate the concentration of antibody based on reactivity of serial dilutions? • Are there trends in mortality and homicide rates by urban setting, age, gender, or race? • How do we model survival following heart bypass surgery? Are there patient characteristics that predict survival? How does the 1 year, 2 year, 5 year survival of bypass patients compare to individuals treated medically for heart disease? • How do attitudes toward enrollment in an HIV vaccine study vary by geography, age, or education? • How does physician experience influence survival of patients with HIV? Biostat 511
Biostatistics 511 • Introduction to the basic concepts of statistics as applied to problems in public health or medicine • Definitions: 1. Data - numerical facts, measurements, or observations obtained from an investigation aimed at answering a question. 2. Statistics - the science and art of obtaining reliable results and conclusions from data that is subject to variation. 3. Biostatistics - the application of statistics to the biologic sciences, medicine and public health. Biostat 511
Role of Statistics in Public Health and Medicine Statistics Science 1. Idea or Question 2. Collect data/make observations 3. Describe data / observations 4. Assess the strength of evidence for / against the hypothesis 1. Math. model / hypothesis 2. Study design 3. Descriptive statistics 4. Inferential statistics Biostat 511
Descriptive Statistics and Exploratory Data Analysis - Univariate • Types of data • Categorical • Continuous • Numerical Summaries • 1. Location - mean, median, mode. • 2. Spread - range, variance, standard deviation, IQR • 3. Shape - skewness • Graphical Summaries • 1. Barplot • 2. Stem and Leaf plot • 3. Histogram • 4. Boxplot • Mathematical Summaries • 1. Density curves Biostat 511
Descriptive Statistics (Exploratory) • “Exploratory data analysis is detective work - numerical detective work” • “Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone- the first step” • organization, summarization, and presentation of data • “Show me the data!” • Tools: • tables • graphs • numerical summaries John Tukey Exploratory Data Analysis Addison-Wesley, 1977 Biostat 511
Inferential Statistics (Confirmatory) • Generalization of conclusions: • sample population • Assess strength of evidence • Make comparisons • Make predictions • Tools: • Modeling • Estimation and Confidence Intervals • Hypothesis Testing Biostat 511
Example: Effect of seat belt use on accident fatality Biostat 511
But, suppose... How does this affect your inference? Biostat 511
Types of Data In statistics we deal with data - measurements or observations on individuals (or, more generally, on the “units of observation”). • Categorical (qualitative) • 1) Nominal scale - no natural order • - gender, marital status, race • 2) Ordinal scale • - severity scale, good/better/best • Numerical (quantitative) • 1) Discrete - (few) integer values • - number of children in a family • 2) Continuous - measure to arbitrary precision • - blood pressure, weight Why bother? PROPER DISPLAYS PROPER ANALYSIS Biostat 511
Categorical data For categorical data we usually summarize with counts. A simple visual summary is the bar graph. N = 74 • Notes: • vertical axis can be count or percent • in the above example, counts do not add to 74 … individuals can have multiple risk factors • tabular presentation may be more parsimonious for such data Biostat 511
Continuous Data Consider the 11 ages: 21,32,34,34,42,44,46,48,52,56,64 Age is a quantitative variable so a barplot doesn’t make sense. Here we are more interested in characteristics of the distribution of ages - where is the center of the age distribution (e.g. the average)? how much does age vary? are there some values far from the bulk of the data? We would like some visual tools to help us answer these questions. Biostat 511
Stem and Leaf Diagram We could group the data and tally the frequencies: But why “hide” the details? Instead, we’ll use the 10’s place as stems and the units as leaves: 20: X 30: XXX 40: XXXX 50: XX 60: X 2* | 1 3* | 244 4* | 2468 5* | 26 6* | 4 Stem age The stemplot or stem and leaf plot is a quick, informative summary for small datasets. Biostat 511
Stem and Leaf Diagram, construction • All but the last digit form the stem. • Stems are stacked vertically from the smallest to the largest. • The leaf is the last digit in a value and is placed next to the appropriate stem (out from smallest to largest) • Shows macro information - general shape, spread, range. • Shows micro information - all values shown. • Fast and easy to construct. Biostat 511
8 2 9 4220 97 3 0 9* 10* 10* 11* 11* 12* 12* 13* 13* 2 77 0122 3 9 6 Back-to-back Stem and Leaf To compare two sets of data, use a back-to-back stem and leaf diagram Fig 1. Systolic blood pressure after 12 weeks treatment with daily calcium supplement or placebo Placebo Calcium (Unfortunately, you can’t do this in Stata) Biostat 511
Methods for Grouped Data The stem and leaf effectively groups continuous data into intervals. Let’s extend this idea. The following terms are useful for grouped data: • frequency - the number of times the value occurs in the data. • cumulative frequency - the number of observations that are equal to or smaller than the value. • relative frequency - the % of the time that the value occurs (frequency/N). • cumulative relative frequency - the % of the sample that is equal to or smaller than the value (cumulative frequency/N). Biostat 511
Example - Birthweights Sample of 100 birthweights in ounces. Complete the following table ... Biostat 511
Histogram • Similar to a barplot, but used for continuous data. • Divide the data into intervals. • A rectangle is constructed with the base being the interval end-points and the height chosen so the area of the rectangle is proportional to the frequency (if the width is one unit for all intervals, then height=frequency). • Shape can be sensitive to number and choice of intervals (rule of thumb: number of bins is smaller of or 10*log10n) • Histograms are more effective for moderate to large datasets. Note: A histogram is a special type of bargraph where variable interval widths are permitted. Biostat 511
Example - Birthweights Right: Wrong: Note: You can determine relative frequency and cumulative relative frequency from a histogram. Biostat 511
Characteristics of Distributions • Shape • number of modes (peaks) • symmetry • Center • where is the center? • Spread • how much variation? • outliers? • Other features • boundaries • digit preference • Biostat 511
Examples Biostat 511
Notation Suppose we have N measurements of a particular variable. We will denote these N measurements as: X1, X2, X3,…,XN where X1 is the first measurement, X2 is the second, etc. Sometimes it is useful to order the measurements. We denote the ordered measurements as: X(1), X(2), X(3),…,X(N) where X(1) is the smallest value and X(N) is the largest. Biostat 511
Arithmetic Mean The arithmetic mean is the most common measure of the central location of a sample. We use to refer to the mean and define it as: The symbol S is shorthand for “sum” over a specified range. For example: Biostat 511
Some Properties of the Arithmetic Mean Often we wish to transform variables. Linear changes to variables (i.e. Y = a*X+b) impact the mean in a predictable way: (1) Adding (or subtracting) a constant to all values: (2) Multiplication (or division) by a constant: Does this nice behavior happen for any change? NO! (show that ) Biostat 511
Median Another measure of central tendency is the median - the “middle one”. Half the values are below the median and half are above. Given the ordered sample, X(i), the median is: N odd: N even: Mode The mode is the most frequently occurring value in the sample. Biostat 511
Example: Central Location Suppose the ages in years of the first 10 subjects enrolled in your study are: 34,24,56,52,21,44,64,34,42,46 Then the mean age of this group is: To find the median, first order the data: 21,24,34,34,42,44,46,52,56,64 The mode is 34 years. Biostat 511
Suppose the next patient enrolls and their age is 97 years. How do the mean and median change? To get the median, order the data: 21,24,34,34,42,44,46,52,56,64,97 If the age were recorded incorrectly as 977, instead of 97, what would the new median be? What would the new mean be? Biostat 511
Comparison of Mean and Median • Mean is sensitive to a few very large (or small) values - “outliers” • Median is “resistant” to outliers • Mean is attractive mathematically • 50% of sample is above the median, 50% of sample is below the median. Biostat 511
Variation is important! Biostat 511
Measures of Spread: Range The range is the difference between the largest and smallest observations: Alternatively, the range may be denoted as the pair of observations: The latter form is useful for data quality control. Disadvantage: the sample range increases with increasing sample size. In the ages example, for the first 10 subjects, the range is Biostat 511
Measures of Spread: Variance Consider the following two samples: 20,23,34,26,30,22,40,38,37 30,29,30,31,32,30,28,30,30 These samples have the same mean and median, but the second is much less variable. The average “distance” from the center is quite small in the second. We use the variance to describe this feature: The standard deviation is simply the square root of the variance: Biostat 511
For the first sample, we obtain: For the second sample, we obtain: Biostat 511
Properties of the variance/standard deviation • Variance and standard deviation are ALWAYS greater than or equal to zero. • Linear changes are a little trickier than they were for the mean: • (1) Add/substract a constant: Yi=Xi+c • (2) Multiply/divide by a constant: Yi=c Xi So what happens to the standard deviation? Biostat 511
Measures of Spread: Quantiles and Percentiles The median was the sample value that had 50% of the data below (or above) it. More generally, we define the pth percentile as the value which has p% of the sample values less than or equal to it. Let k = p*N/100. (1) If k is an integer, pth percentile is the average of X(k) and X(k+1). (2) If k is not an integer, pth percentile is X([k]+1). Here, [k] is the largest integer smaller than k (i.e. truncate the decimal). Quartiles are the (25,50,75) percentiles. The interquartile range is Q.75-Q.25 and is another useful measure of spread. The middle 50% of the data is found between Q.25 and Q.75. Biostat 511
Boxplot A graphics display of the quartiles of a dataset, as well as the range. Extremely large or small values are also identified. Drug Biostat 511
Boxplot, construction 1. Order the data 2. Compute the median and draw a line at this value. 3. Compute the hinges, Q.25 and Q.75. 4. Draw lines at the hinges (quartiles) and enclose in a box. 5. Compute the IQR = Q.75- Q.25 . 6. Compute the upper fence = Q.75 + 1.5*IQR lower fence = Q.25 - 1.5*IQR Observations beyond the fences are called outliers. 7. Draw a line (whisker) from Q.75 to the largest non-outlying value 8. Draw a line from Q.25 to the smallest non-outlying value. 9. Mark points outside of the fences (outliers). Biostat 511
Skewness Both histograms and boxplots can show us that a distribution is skewed. Skewness refers to the symmetry or lack of symmetry in the shape of the distribution. Neither the mean nor the variance tell us about symmetry. Skewness is based on the average of . 1. Skew = 0; “symmetric”; median = mean 2. Skew > 0; “positive” or “right” skewed; median<mean 3. Skew < 0; “negative” or “left” skewed; median>mean Biostat 511
Density Curves We have seen how continuous data can be summarized with a histogram. Although histograms are summaries of the data, they still involve keeping track of a lot of numbers (i.e. the height and location of each bar). Is there a way to summarize the entire distribution of our data with just a few numbers? YES! We can use a type of mathematical model known as a density curve. Biostat 511
Density Curves Biostat 511
Density Curves We saw previously that we can use a histogram to determine the relative frequency (= proportion = probability) of obtaining observations in a particular interval. If a particular density curve provides a good fit to our data then we can use the density curve to approximate these probabilties. In particular, the probability of obtaining an observation in a particular interval is given by the area under the density curve. Note: For continuous data, it does not make sense to talk about the probability of an individual value (i.e. P(X = 6) 0.0) Biostat 511
Relative frequency of scores less than 6 from histogram = .303 Probability of scores less than 6 from density curve = .293 Biostat 511
Probability density function 1. A function, typically denoted f(x), that gives probabilities based on the area under the curve. 2. f(x) > 0 3. Total area under the function f(x) is 1.0. Cumulative distribution function The cumulative distribution function, F(t), tells us the total probability less than some value t. F(t) = P(X < t) This is analogous to the cumulative relative frequency. Biostat 511
Normal Distribution • A common model for continuous data • Bell-shaped curve • takes values between - and + • symmetric about mean • mean=median=mode • Examples birthweights blood pressure CD4 cell counts (perhaps transformed) Biostat 511
Normal Distribution Specifying the mean and variance of a normal distribution completely determines the probability distribution function and, therefore, all probabilities (just 2 numbers!). The normal probability density function is: where 3.14 (a constant) Notice that the normal distribution has two parameters: = the mean of X = the standard deviation of X We write X~N( , 2). The standard normal distribution is a special case where = 0 and = 1. Biostat 511
For a standard normal distribution ... In general, ~68% of data within 1 of ~95% of data within 2 of ~99.7% of data within 3 of Biostat 511
Summary • Types of data • Categorical • Continuous • Numerical Summaries • 1. Location - mean, median, mode. • 2. Spread - range, variance, standard deviation, IQR • 3. Shape - skewness • Graphical Summaries • 1. Barplot • 2. Stem and Leaf plot • 3. Histogram • 4. Boxplot • Mathematical Summaries • 1. Density curves Biostat 511