270 likes | 452 Views
Lecture 1 Review. Measurement Typologies Univariate Distributions Location and Dispersion. Measurement Typology. Stevens (1946) identified 4 levels of measurement. They are: Nominal or categorical Ordinal Interval Ratio. Levels of Measurement.
E N D
Lecture 1Review Measurement Typologies Univariate Distributions Location and Dispersion
Measurement Typology Stevens (1946) identified 4 levels of measurement. They are: • Nominal or categorical • Ordinal • Interval • Ratio
Levels of Measurement • Nominal or categorical: A series of unordered categories. Eg: (male, female) or (Manitoba, Saskatchewan, Alberta) • Ordinal: A series in which there is an underlying order. We are, however, unaware of the “distance” between each category in terms of the characteristic. Eg: (Short, medium, tall) or (high school, College diploma, B.A., graduate degree).
Levels of Measurement • Interval scales: The distance between possible values is constant, allowing us compare precisely the difference in outcomes. • Ratio scales are interval scales in which a value of zero is possible when there is none of the phenomenon being measured present. For example, for a living person, an age value of zero is not really possible, but it is possible to have zero years of formal schooling. • Practically, these two levels are often combined as “interval/ratio” measurement.
Continuous vs. Discrete Measures A continuous variable can take on an infinite number of possible values, For example, age could theoretically be measured in infinitely precise units. On the other hand, the number of children one has is fundamentally discrete, as are other count data.
Continuous vs. Discrete Measures It is very important to consider carefully the underlying process that generates the measurement, and not only the variables as provided in the data set. For example, many survey datasets include the number of years of schooling as a measure of educational attainment. This is a discrete measure (only whole years), but which can often be considered conceptually continuous. However, we must consider whether is makes sense to think of the variable in this way.
Univariate Distributions Frequency distribution: an ordered list of the possible values of a variable and the number or proportion of observations in the variable.
Univariate Distributions Histograms and Bar Charts graphically display the absolute or relative frequencies.
Univariate Distributions Stem- and-Leaf Plots (Tukey, 1972, 1977) display more information than histograms, and are good for small data sets. They group together all the data with the same leading digits on the “stem”, and show the final digits as the “leaves”. R's Occupational Prestige Score (1980) Stem-and-Leaf Plot Frequency Stem & Leaf 7.00 1 . 79 85.00 2 . 01222222223333444 145.00 2 . 555677788888888888899999999999 190.00 3 . 00000000001111111112222222223333344444 174.00 3 . 5555556666666666666666778899999999 189.00 4 . 00000000000111122222222222333344444444 216.00 4 . 5555566666666666666777777777777888889999999 162.00 5 . 000001111111111111111111112223444 42.00 5 . 556778899 103.00 6 . 000000111234444444444 73.00 6 . 5555666666689& 21.00 7 . 1344& 7.00 7 . 5 4.00 Extremes (>=86) Stem width: 10 Each leaf: 5 case(s) & denotes fractional leaves.
Central Tendency and Dispersion • Common measures of central tendency (Location) are the mean, median, and the mode. • The mode is the most common value. • The median is the value above which half of the subjects fall (the 50th percentile).
Central Tendency • The arithmetic mean, or average, is the sum of the values, divided by the number of subjects,
Dispersion • Distributions can have similar central tendencies, but be dramatically different in their spread, or dispersion. • One measure of dispersion is the range or the difference between the largest and smallest observations. The range is a good measure, but is very sensitive to extreme values, or outlying values.
Dispersion • Another is the interquartile range, which measures the distance between the upper and lower quartiles. • Quartiles are the values below which 0, 25, 50, 75, and 100% of the cases fall. • Other quantiles, besides quartiles can be used. Deciles can be used to describe the difference in mean income for the lowest decile (bottom 10%) compared to the highest decile (top 10%).
Dispersion • Box-and-whisker plots show the dispersion through the use of quartiles. • The box contains the middle 50% of cases, the line indicates the median, and the whiskers extend to the 25 and 75 percentiles.
Sums of Squares, Variance, and Standard Deviation • The variance is the average of the squared deviations of the elements in the sample or the population around their mean. Squaring the deviations keeps their sum from being zero. Therefore, we use the sum of squares in calculating the variance.
Variance • The formula for the variance in a population is: • The sample variance is: Where n is the sample size. (n-1) in the denominator is used when we are using the sample variance to estimate the population variance
Standard Deviation • The Standard Deviation is the square root of the variance. It is used particularly because the variance is sensitive to the choice of units. Population standard deviation: Where N is the population size, Yiis the value of Y for the ith unit in the population, and μ-bar is the population mean.
Standard Deviation Sample standard deviation: Where n is the sample size, yiis the value of y for the ith unit in the sample, and y-bar is the sample mean.
The Shape of a Distribution It is common to describe distributions in terms of their shape when graphed. For example, some distributions can be described as “bathtub”, “inverted bathtub” or U-shaped, or as “bell-shaped”.
Modality Distributions may have only one mode, or have several distinct modes. For a bimodal distribution, a single mode will provide a poor description of the central tendency.
Skewness Asymmetrical curves may also be described with regard to their skewness. If the mean is higher than the median, or the right tail is considerably longer, we say that the distribution is skewed to the right. If the mean is less than the median, or the left tail is longer, it is skewed to the left.
Kurtosis A kurtotic distribution is one that is significantly more peaked than a normal distribution.
Outliers • Outliers are cases which score much higher or lower than the bulk of the other cases in the sample or the population. • They can be due to problems with the data, such as the inclusion of a case which should have been excluded from the sample (frame problems), or mis-entry of data. • They may also be legitimate cases, which simply have unusually high or low values on some variable. For this reason, outliers should never simply be discarded without investigation.
Questions? • What is kurtoisis? • Define an “outlier” • What information do box-and-whisker plots include? • What is the “sum of squares?”
Next Class: • Probability Distributions • Standard Normal probability distribution • Sampling distributions and estimation