330 likes | 345 Views
Edpsy 511. Exploratory Data Analysis Homework 1: Due 9/20. Landmarks in the data. Quartiles We’re often interested in the 25 th , 50 th and 75 th percentiles. 39, 38, 38, 36, 36, 31, 29, 29, 28, 19 Steps First, order the scores from least to greatest. Second, Add 1 to the sample size.
E N D
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/20
Landmarks in the data • Quartiles • We’re often interested in the 25th, 50th and 75th percentiles. • 39, 38, 38, 36, 36, 31, 29, 29, 28, 19 • Steps • First, order the scores from least to greatest. • Second, Add 1 to the sample size. • Why? • Third, Multiply sample size by percentile to find location. • Q1 = (10 + 1) * .25 • Q2 = (10 + 1) * .50 • Q3 = (10 + 1) * .75 • If the value obtained is a fraction take the average of the two adjacent X values.
Shapes of Distributions • Normal distribution • Positive Skew • Or right skewed • Negative Skew • Or left skewed
Statistics vs. Parameters • A parameter is a characteristic of a population. • It is a numerical or graphic way to summarize data obtained from the population • A statistic is a characteristic of a sample. • It is a numerical or graphic way to summarize data obtained from a sample
Types of Numerical Data • There are two fundamental types of numerical data: • Categorical data: obtained by determining the frequency of occurrences in each of several categories • Quantitative data: obtained by determining placement on a scale that indicates amount or degree
Techniques for Summarizing Quantitative Data • Frequency Distributions • Histograms • Stem and Leaf Plots • Distribution curves • Averages • Variability
Summary Measures Summary Measures Variation Quartile Central Tendency Median Arithmetic Mean Mode Range Variance Standard Deviation
Measures of Central Tendency Central Tendency Average (Mean) Median Mode
Mean (Arithmetic Mean) • Mean (arithmetic mean) of data values • Sample mean • Population mean Sample Size Population Size
Mean • The most common measure of central tendency • Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14 Mean = 5 Mean = 6
Weighted Mean A form of mean obtained from groups of data in which the different sizes of the groups are accounted for or weighted.
Median • Robust measure of central tendency • Not affected by extreme values • In an Ordered array, median is the “middle” number • If n or N is odd, median is the middle number • If n or N is even, median is the average of the two middle numbers 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14 Median = 5 Median = 5
Mode • A measure of central tendency • Value that occurs most often • Not affected by extreme values • Used for either numerical or categorical data • There may may be no mode • There may be several modes 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 No Mode Mode = 9
Variability • Refers to the extent to which the scores on a quantitative variable in a distribution are spread out. • The range represents the difference between the highest and lowest scores in a distribution. • A five number summary reports the lowest, the first quartile, the median, the third quartile, and highest score. • Five number summaries are often portrayed graphically by the use of box plots.
Variance • The Variance, s2, represents the amount of variability of the data relative to their mean • As shown below, the variance is the “average” of the squared deviations of the observations about their mean • The Variance, s2, is the sample variance, and is used to estimate the actual population variance, s 2
Standard Deviation • Considered the most useful index of variability. • It is a single number that represents the spread of a distribution. • If a distribution is normal, then the mean plus or minus 3 SD will encompass about 99% of all scores in the distribution.
Σ(X – X)2 Σ(X – X)2 Variance (SD2) = N-1 3640 9 N-1 √ Calculation of the Variance and Standard Deviation of a Distribution Raw Score Mean X – X (X – X)2 85 54 31 961 80 54 26 676 70 54 16 256 60 54 6 36 55 54 1 1 50 54 -4 16 45 54 -9 81 40 54 -14 196 30 54 -24 576 25 54 -29 841 = =404.44 Standard deviation (SD) =
Comparing Standard Deviations Data A Mean = 15.5 S = 3.338 11 12 13 14 15 16 17 18 19 20 21 Data B Mean = 15.5 S = .9258 11 12 13 14 15 16 17 18 19 20 21 Data C Mean = 15.5 S = 4.57 11 12 13 14 15 16 17 18 19 20 21
Facts about the Normal Distribution • 50% of all the observations fall on each side of the mean. • 68% of scores fall within 1 SD of the mean in a normal distribution. • 27% of the observations fall between 1 and 2 SD from the mean. • 99.7% of all scores fall within 3 SD of the mean. • This is often referred to as the 68-95-99.7 rule
Fifty Percent of All Scores in a Normal Curve Fall on Each Side of the Mean
Standard Scores • Standard scores use a common scale to indicate how an individual compares to other individuals in a group. • The simplest form of a standard score is a Z score. • A Z score expresses how far a raw score is from the mean in standard deviation units. • Standard scores provide a better basis for comparing performance on different measures than do raw scores. • A Probability is a percent stated in decimal form and refers to the likelihood of an event occurring. • T scores are z scores expressed in a different form (z score x 10 + 50).