270 likes | 555 Views
Quantitative Methods – Week 2: Descriptive Statistics. Roman Studer Nuffield College roman.studer@nuffield.ox.ac.uk. Frequency Distributions. Frequency distributions provide a summary presentation of the data Very good method to get a first overview of the data Discrete variables
E N D
Quantitative Methods – Week 2: Descriptive Statistics Roman Studer Nuffield College roman.studer@nuffield.ox.ac.uk
Frequency Distributions • Frequency distributions provide a summary presentation of the data • Very good method to get a first overview of the data • Discrete variables Measure the frequency of occurrence of each of the values Example: Poor Law Dataset: Number of workhouses per counties
Frequency Distributions (II) • Continuous variables • Choose appropriate class intervals and not the frequency of occurrence for each class • Number of class intervals normally between 5 and 20 Example: Per capita relief payments in the parishes of Kent, 1831
Descriptive Statistics • Making frequency tables and plotting them using histograms and frequency curves is very helpful to get a first overview • But it isn’t a precise way to summarize the information of the variables in a dataset. To do this, we normally determine three features of a variable: • Which are the most central (i.e. the most common or typical) values? • How are the values spread (dispersed) around those central values? • What is the shape of the distribution? • Each of these features can be described by one or more simple statistics, and they form the basic elements of descriptive statistics, as the provide a precise and comprehensive summary of the variables in a data set
Measures of Central Tendency • The arithmetic mean • Adding up all the values and dividing this total by the number of observations • The median • The value that has one-half of the number of observations above and below it, when the series is set out in an ascending or descending array • Uneven number of observations: Position = (number of observations + 1) / 2 • Even number of observations: Average of two middle observations • The mode • Value that occurs most frequently • Percentiles, deciles, and quartiles • Instead of dividing the observations into two equal halves (median), we can divide them into four equal quarters (quartiles), or into 10 portions (deciles) or 100 portions (percentiles) of equal size
Measures of Central Tendency (II) • Numeric Example • What is the mean, the median, the mode? 6 5 5 • What is the effect of adding value x8? 9 6 5 • What is the absolute and what is the relative frequency of 5? 2 25%
Measures of Dispersion • Two variables with equal arithmetic mean, but different spread f(x) f(y) f(x) f(y) m x,y • Variable x is more densely distributed around the mean m than variable y
Measures of Dispersion (II) • The variance • The variance is equal to the arithmetic mean of the squared deviations from the mean • The variance is widely used in statistical work; however, the disadvantage is that it is expressed in square units… • The standard deviation • The standard deviation is the square root of the variance • Interpretation: Average or typical deviation of variable x from the arithmetic mean • The standard deviation is the most widely used measure of dispersion; however, as it is calculated in the same units as the series, these absolute standard deviations are unsuitable for comparisons with series that have different underlying units…
Measures of Dispersion (III) • The coefficient of variation (CV) • The coefficient of variation is a measure of relative rather than absolute variation • It is obtained by dividing the standard deviation by the mean • Interpretation: Average percentage deviation from the mean • The range • This is a very crude measure of dispersion defined as the difference between the maximum and the minimum value in the series
The Shape of Distributions • Normal distribution • The normal distribution is a symmetrical, smooth, bell-shaped distribution that is fully described by the arithmetic mean and standard deviation • Mode, median and mean are equal • Measures of skewness and kurtosis of the normal distribution are equal to 0 and 3 • But again: Mean and standard deviation are dependent on units of the series and thus difficult to compare…
The Shape of Distributions (II) • Standard normal distribution • Every normal distribution can be transformed into a standard normal distribution using • By definition, the standard normal distribution has now two further basic features that the normal distribution hasn’t: • mean =0 • standard deviation =1 • These properties make the distribution ideal for comparison • The standard normal distribution has for this reason a key role in inductive statistics as it can be used to make inferences on probabilities
The Shape of Distributions (III) • Skewed distributions • However, values need not be symmetrically distributed around the central point, i.e. distributions can be skewed • In these cases, Mean and standard deviation are insufficient to describe the distribution • Especially socio-economic data (wages, income, wealth and related variables) is frequently skewed Frequency This distribution is skewed to the right (positively skewed) x Mode Mean Median
The Shape of Distributions (IV) • Consequences of skewed distributions • Skewed variables can lead to undesirable effects in regressions • Non-normal distributed residuals (misspecification) • Heteroscedasticity; test statistics and confidence intervals are biased • (Roughly) normal distributed variables help to avoid these problems. Take a look at the variable • If the variable is not significantly skewed, continue • If the variable is skewed, transform the variable: “Ladder of Powers”. For this reason you often find the logarithm of income, the square root of the mortality rate, etc.
The Shape of Distributions (V) • Kurtosis • Furthermore, two symmetrically distributed variables with equal mean and standard deviation can still have a different distribution, i.e. they can have a different kurtosis f(x) Here the variable y has the bigger kurtosis than variable x f(y) f(y) sy sx f(x) m x,y x
The Shape of Distributions (VI) • Measures for skewness and kurtosis • Measures for skewness and kurtosis tell us therfore more about a distribution • Skewness and kurtosis of a normal distributed variable are zero and three, respectively • Skewness: • a3 > 0 distribution skewed to the right/ positively skewed • a3 < 0 distribution skewed to the left/ negatively skewed • Kurtosis: • a4 > 3 thinner tails & higher peak than a normal distribution • a4 < 3 thicker tails & lower peak compared to a normal distribution • For a meaningful and comparable measure of a4, the distribution should be symmetrical (hence again the need to have a normal distribution)
Computer Class: • Getting started with STATA • Descriptive statistics
STATA Basics • Stata is a statistical package for managing, analysing, and graphing data • It can be used in two different ways • As a point-and-click application • Easy interface for those new to Stata, and for those who don’t use it very often • … for us (at least at the beginning)! • As a command-driven package • Very fast if used to commands • Good for communicating more complex ideas • One of the main advantages of Stata over SPSS • A helpful guide • Hamilton, Lawrence C., Statistics with STATA. Constantly updated versions.
Getting Started Together Various data formats • Data comes in various data formats and extensions, most often in • .xls : Excel • .sav : SPSS • .dta : STATA • .txt : Text files • STATA can import all these formats: File/Import/.... • Download data file • Relief dataset from Feinstein & Thomas, get online: http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=9780521806633&ss=res • Download the Stata file and save it to your folder on the O: drive • Open data • Open/Relief dataset • Open data editor • Open Data Editor and try to understand the structure of the dataset • What do the rows and columns mean? • Change the names of some variables • Sort the relief payments in ascending order: what was the minimum paid, what was the maximum?
Getting Started Together (II) 4) List some variables: Data/Describe data/List data • Relief • Income 5) Tabulate some variables: • Income • Relief 6) Frequencies • Get an overview of the distribution with a histogram (Graphics/Histogram) • The number of bins changes the number of bars (or the number of categories) • Which variables look normally distributed, which ones not? 7) Descriptive statistics (Central tendencies & dispersion) • Mean, stdv, min, max (Data/Describe data/Summary statistics) • Skewness, kurtosis, median, quartiles, percentiles, etc. (Data/Describe data/Summary statistics/Display additional statistics)
Getting Started Together (III) 8) Export some tables, graphs to Word • Right-click and copy; insert in Word 9) If you’re stuck: Help/…
Appendix: STATA Commands • edit Opens the Data Editor • sort Arranges the observations into ascending order based on the values of the # variable • tabulate varname Produces one-way tables of frequency counts: absolute & relative & cumulative frequency. • summarize varname Calculates a variety of summary statistics (obs, mean, stdv, min, max) • summarize varname, detail Gives more detailed statistics, for instance kurtosis, skewness, percentiles, etc. • histogram varname, bin(x) Creates a histogram with x categories
Homework • Readings: • Feinstein & Thomas, Ch. 3 • Problem Set 1: • Do the exercises 1, 2 (Relief dataset) , and 7 (The Old Poor Law in England) from chapter 2.7 (pp. 66-70) • Submit your solutions including graphs and tables in a Word file by noon on Monday (29 January)