Introduction to Data & Statistics

Introduction to Data & Statistics Module 10 Sept. 3, 2014

Agenda • Stats Lecture • Univariate analysis (looking at one variable) … central tendencies, and variability (dispersion) 2) Bivariate analysis (comparing two variables) … correlation, t-test, chi-square association 3) Additional context for stats assignment in group project • Applications in SPSS (handout) • Discussion

Where do we start? Univariate Analyses • Need to make sure all our variables (e.g. scores on a scale, income figures, gender, ethnicity) are behaving appropriately for statistical testing • Each must have some variability(e.g. if all women, no variability, cannot do outcomes based on gender) • Need to check out how much variability and typical values for each • For example, a typical value may be its average or mean value • These analyses called univariate analyses. • Univariate analysis involves the examination across cases of one variable at a time.

Summarizing Univariate Distributions • Any set of measurements that summarizes a variable should have two important properties: 1. The Central Tendency (or typical value) mode, median, mean 2. The Spread (variability or dispersion) about that value range, variance, standard deviation (That is, how do each of the data values differ from the mean or median value? )

Example of central tendency and variation 2 Assume mean = 5.0 Each point varies around the mean.

Example of central tendency and variation 2 Assume mean = 5.0 Each point varies around the mean. This variation contributes to the overall standard deviation (SD) More on standard deviations, later…

Measures of Central Tendency • An estimate of the center of a distribution of values; how much our data are similar • The means to determine what is most typical, common, and routine • Central tendency is usually summarized with one of three statistics: 1) Mode 2) Median 3) Mean

Measures of Central Tendency 1The Mode • The mode, the most frequent value in a distribution, is the least often used as it easily gives a misleading impression: mnemonic - mode = most. • If the mode occurs twice, then the distribution is called bimodal. • Can be used for all four levels of measurement (for nominal, just the most common response: ex. the number of female and male in a study) • May not be effective in describing what is typical in the distribution of a variable

Measures of Central Tendency 1The Mode example What is the most frequent value? 28, 31, 38, 39, 42, 42, 42, 42, 43, 47, 51, 51, 54, 55, 56, 56, 58, 59, 59, 59 (this listing of the data set is called an array)

Where is the mode in each of these distributions?

Measures of Central Tendency 2The Median • The median, the point that divides the distribution in half; the midpoint of a set of numbers • To find the median value of a data set, arrange the data in order from smallest to largest • Must be used for at least ordinal level of measurement – why? • Unlike the mode, the median does not always coincide with an actual value in the set (unless the set has an odd number of values

Measures of Central Tendency 2The Median Example 2, 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20 19 points, 10th one is the Median = 9 Median • If the number of points is even – then average the two values around the middle (n = 18): 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20 • 9 + 10 / 2 = 9.5 Median

Measures of Central Tendency 3Mean • The mean, or statistical average, takes into account the values of each case in the distribution • It is the sum of all of the values divided by the total # of the values. • Must be interval or ratio level measurements (e.g., weight, age, miles driving). • Should not be computed for ordinal level – why? • Mean can promote accuracy or distortion depending on whether the distribution is symmetrical or skewed.

Measures of Central Tendency 3The Mean Example 2, 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20 ANSWER: 2+2+3+3+4+5+5+7+8+9+10+11+11+14+14+15+16+18+20 Total N = 19 = 177 / 19 = 9.32 = SUM of all values / N

What is the Normal Distribution? It looks like a bell with one “hump” in the middle, centered around the population mean, and the number of cases (data) tapering off to both sides of the mean; the symmetrical distribution of scores around the mean

Normal Distribution (aka, Bell Curve) – where is the mean, median, and mode? 16

Normal Distribution (aka, Bell Curve) – where is the mean, median, and mode?In a perfect normal distribution, mean, median and mode are equal! Mode Median Mean 17

Means and variances are best measures for symmetric or normal distributions Describe by using arithmetic MEAN VARIANCE (standard deviation) Secondarily, Range Mode (most common value) Skew (left or right) Kurtosis (thickness of tails)

Normal Distribution - Skewness • Skewnessis used in describing abnormal distributions. • In a normal curve, the right and left halves of the curve • are mirror images of each other. • If this is not the case, the curve is said to be skewed, either • positively(to the right) or negatively (to the left). • If the scores tend to be concentrated toward the high • end of the score scale, the curve is negatively skewed. • If they are concentrated toward the low end of the score • scale, they are positively skewed • Skewnessis measured from -3.0 to + 3.0 • 0 skew score = symmetrical distribution 19

Normal Distribution - Skewness 20

Example. Means and standard deviations for all study variables Mean=50 Mean=80

The Outlier Affect • Outlier: a result that is far different from most of the results for the group; extreme value(s) that can skew the overall results • Median and mode are not sensitive to outliers. That is, they tend not to change with outliers • Mean is sensitive to outliers. Mean can change greatly with outliers.

To Address Outliers in Mean Calculations… • Trimmed mean: do not use the top and bottom five percent of scores • In this example, we have 20 values. The lowest and highest values reflect the lowest 5% and highest 5% values in this list 2 40 45 46 52 52 55 59 60 61 61 63 64 66 66 66 67 69 70 259 • Mean for n = 20 is 66.2, • Trimmed mean for n = 18 is 53.1

Which measure of central tendency should we use? • Both the median and mean are used to summarize the central tendency of quantitative variables. • To decide which to use, consider these issues: 1. Level of measurement: • the median can be used with ordinal level data (often used in scales); but, • the mean requires interval or ratio level data. • the mode should be used for nominal level data. (Think Yes=1 and No=0 data. What would 0.36 mean? And 0.72?)

Which measure of central tendency should we use? • Both the median and mean are used to summarize the central tendency of quantitative variables. • To decide which to use, consider these issues: 2. The shape of the distribution • the median should be used when the data is skewed or has many outliers • the meanshould be used when the data is fairly “bell shaped” or normal. • Tip: Use the mean when the mean and median are very similar.

Mean or Median? • Shape of variable’s distribution: • The mean and median will be the same when the distribution is perfectly symmetric. • When the distribution is not symmetric, the mean is pulled in the direction of extreme values, but the median is not affected in any way by extreme values. • Purpose of the statistical summary: • If the purpose is to report the middle position, then the median is the appropriate statistic. • If the purpose is to report a mathematical average, the mean is the appropriate statistic.

Normal distributions: means and medians are very close Arithmetic MEAN(average value) is nearly the same at the MEDIAN (50th percentile, or value where half of the ranked data points lie above and below.)

Measures of Variability (Variation/Dispersion) • How different the data are from each other and is reported by how the scores fall around the mean • For nominal data, simply looks at how many in each category, for the rest… • Captures how widely and densely spread a variable’s distribution is.

Measures of Variability • Variability is usually summarized with one of four statistics: 1) The Percent of responses in each category (nominal data) 2) The Range (ordinal and higher) 3) The Variance (interval and ratio) 4) The Standard Deviation (interval and ratio)

Measures of Variability 1Percentage & Range • For nominal data, simply report percentage in categories (51% female, 22% social workers) • For ordinal, interval 7 ratio data, the range is calculated as the difference between the highest value in a distribution and the lowest value. • It can be drastically altered by a extreme value (an outlier) • “Maximum value minus the minimum value + 1” Example: 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 20 Range is 20 – 2 + 1 = 19 2, 3, 3, 4, 5, 5, 7, 8, 9, 10, 11, 11, 14, 14, 15, 16, 18, 100 Range is 100 – 2 + 1 = 99 (outlier effect)

Example of central tendency and variation 2 Assume mean = 5.0 Each point varies around the mean.

Example of central tendency and variation 2 Assume mean = 5.0 Each point varies around the mean. This variation contributes to the overall standard deviation (SD)

Measures of Variability 2Variance • Variance • The variance is the average of the squared differences from the mean. • It takes into account all the scores to determine the spread. • To calculate the variance follows these steps: • Work out the mean (the simple average of the numbers) • For each number: subtract the mean and then square the result (the squared difference) • Work out the average of those squared differences.

Example of central tendency and variation 2 Assume mean = 5.0 Each point varies around the mean. This variation contributes to the overall standard deviation (SD) First, calculate the mean. Find difference at each point. Square difference and sum. Variance = Variance = SD =

Calculations in Excel table

Variance – Example You and your friends have just measured the heights of your dogs. • The heights are: 600mm, 470mm, 170mm, 430mm and 300mm. • Find the Mean: • Mean = 600+470+170+430+300/5=394 • 2. Calculate each dogs difference from the Mean: (600-394=206), (470-394=76), (170-394=-224)…. Mean=394 3. To calculate the Variance, take each difference, square it, and then average the result: Variance: σ2 = 2062 + 762 + (-224)2 + 362 + (-94)2 = 108,520 → 108,520/5 = 21,704

Measures of Variability 3Standard Deviation Standard Deviation: σ = √21,704 = 147.32... = 147 Now we can show which heights are within one Standard Deviation (147mm) of the Mean: So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small. Rottweillersare tall dogs. And Dachsundsare a bit short The variance and standard deviation are calculated via your software programs like SPSS, Excel, SAS and others, even on hand calculators Thank goodness for modern technology! • Standard Deviation • Standard deviation is the square root of the variance: √(variance) • SD tells us what degree the values cluster around the mean.

Overview 39

Bivariate Statistics Now that we know a bit about each of our variables, we can start comparing them to each other We can also look at differences among groups When comparing two variables or groups, use bivariate statistics Multivariate statistics look at the relationships among many variables or groups at one time, beyond the scope of our class

Comparing variables and groups…Parametric Statistics • Parametric statistics require certain assumptions/qualities in data/variables: • Normal distributions • Dependent variable is interval/ratio • Good sample size (at least 30) • Examples of parametric statistics 1. Correlation: Is there a relationship between variables? 2. T-Tests : Are there mean differences in outcomes between two groups? 3. Analysis of Variance (ANOVA) : Are there mean differences in outcomes among groups? (two or more groups; will not do in this class)

Probability Value • A report of how likely the relationship indicated is statistically significant or may have happened by chance • In other words, how sure are we what we found was not just a fluke? • Most researchers set the level for statistical significance at 0.05 or smaller (or 0.01, 0.001) • Indicated by P Value, e.g. P< .05 means there is less than 1 in 20 chance of results due to sampling error P<.01; less than 1 in 100 chance; p<.001; less than 1 in 1,000 chance

To determine if a relationship exists between two variables and the direction of the relationship “What is the actual strength and direction of the relationship between variables within the sample?” To determine the degree to which the variables are related and the probability that this relationship occurred by chance “What is the probabilitythat the relationship between variables within the sample is due to sampling error?” These variables must be measured at the interval or ratio level. Correlation 43

To determine if a relationship exists between two linear variables and the direction of the relationship “What is the actual strength and direction of the relationship between variables within the sample?” Correlation 44

To determine if a relationship exists between two linear variables and the direction of the relationship “What is the actual strength and direction of the relationship between variables within the sample?” Correlation 45

Correlation • Strength indicated by a correlation coefficient (Pearson’s r) • Correlation Coefficient = provides the numerical value that indicates both the strength and direction of the relationship (r): (–) 1.0 = perfect negative relationship (+) 1.0 = perfect positive relationship • The closer the coeffecient is to either +1.0 or –1.0, the stronger the linear relationship • Middle = moderate / weaker relationship • Close to 0 = no relationship

Range of Correlation Coefficients Correlation Coefficients (r) 0.0 No correlation +1.0 Perfect positive -1.0 Perfect negative 47

Correlation Matrix All variables are listed in the left side column & repeated in a row on the top. Find the direction & strength of the correlation between variables by noting the correlation coefficient and probability that appears in the following matrix. The row in which the first variable appears intersects with the column headed by the second variable: 48

Example. Correlations among study variables ** Correlation is significant at the 0.01 level (2-tailed). * Correlation is significant at the 0.05 level (2-tailed).

t-tests A statistical procedure that tests the means of two groups to determine if they are statistically different. Two common types: Independent sample t-test Paired sample t-test (use when you are comparing means on the same subject over time, each subject having two measures. E.g. Useful when comparing a linear measure over two test events. These are dependent samples, not independent samples. 50

Introduction to Data & Statistics