1k likes | 1.14k Views
Statistics bootcamp. Laine Ruus Data Library Service, University of Toronto Rev. 2005-04-26. Outline. Describing a variable Describing relationships among two or more variables. First, some vocabulary.
E N D
Statistics bootcamp Laine Ruus Data Library Service, University of Toronto Rev. 2005-04-26
Outline • Describing a variable • Describing relationships among two or more variables
First, some vocabulary • Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable. • Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"
The variable ‘Sex’ can be coded as: • 1=’male’ 2 =’female’ 3 = ‘no response’, or • 1=’female’ 2 =’male’, or • 1=’male’ 2 =’female’, or • ‘M’=’male’ ‘F’ =’female’, or • ‘male’ ‘female’, or even • 1=’yes’ 2 =’no’ 3 = ‘maybe’
The values a variable can take must be: • exhaustive: include the characteristics of all cases, and • mutually exclusive: each case must have one and only one value or code for each variable
What’s wrong with this coding scheme? • Under $3,000 • $3,000-$7,000 • $8,000-$12,000 • $13,000-$17,000 • $18,000-$22,000 • $23,000-$27,000 • $28,000-$32,000 • $33,000-$37,000 • $38,000 & over (Source: Census of Canada, 1961: user summary tapes)
Variables are normally coded numerically, because: • arithmetic is easier with numbers than with letters • some characteristics are inherently numeric: age, weight in kilograms or pounds, number of children ever borne, income, value of dwelling, years of schooling, etc. • space/size of data sets has, until recently, been a major consideration
Three basic types of variables • Categorical • Nominal, aka nonorderable discrete Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc. • Ordinal, aka orderable discrete Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc. • Continuous, aka interval, numeric Eg actual age, income, etc.
Descriptive statistics summarize the properties of a sample of observations • how the units of observation are the same (central tendency) • how they are different (dispersion) • how representative the sample is of the population at large (significance)
Nominal variable: • Central tendency • mode • Dispersion • frequency distribution • percentages, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • bar chart
Mode: the category with the largest number or percentage of observations in a frequency distribution
The frequencies can be visualized as a bar chart (based on percentages):
The same distribution, from one of the Canadian overview files:
Why the differences? • What is the population in each table? • What is in the denominator in each table? • Which one is correct?
The most important thing to know about any distribution, whether it is a rate, a proportion, or a percent, is what is in the denominator. And it must always be reported.
We can also derive the distribution information from the 2001 individual pumf using a statistical package:
If we weight the distribution, both the frequencies and the percentages will almost match the distribution from the Profile file:
Just a few words on weighting: • The weight is the chance that any member of the population (universe) had to be selected for the sample • In general • the weight can be used to produce estimates for the total population (population weight) • and/or the weight can adjust for known deficiencies in the sample (sample weight)
and a few more words on weighting… • The 2001 census public use microdata file of individuals is a 2.7% sample of the population • The weight variable (weightp) ranges from 35.545777-39.464996 • Knowing who was excluded from the sample is as important as knowing who was included
And some final words on weighting… • When to use the population weight variable • when you are producing frequencies to reflect the frequency in the population • When you don’t need to use the population weight variable • when you are producing percentages, proportions, ratios, rates, etc. • When to use a sample weight variable • always
Proportions, percents, and odds • Percent: ‘the percent married in the population 15 years and over is…” = (#married/population >15 years)*100=49.47% • Proportion: “the proportion married in the population 15 years and over is …” = (#married/population >15 years)=.4947 • Odds: “the odds on being married are…” =(#married/#not married in the population) or =proportion married/(1-proportion married) =.4947/(1-.4947)=.4947/.5053=.9790
Coefficient of variation • measures how representative the variable in the sample is of the distribution in the population • computed as ((standard deviation/mean)*100) [we will discuss these measures in the context of continuous variables] • see Stats Can guidelines in user guides: • cv< 16.6% is ok to publish, cv>33.3% do not publish • SDA reports the cv when generating frequencies
Ordinal variable: • Central tendency • median and mode • Dispersion • frequency distribution • range • percentages/quantiles, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • histogram
The Median is the value that divides an orderable distribution exactly into halves. Finding the median is easier if we compute cumulative percentages, eg in Excel
% Cum% 0-4 years 5.65 5.65 5-9 years 6.59 12.24 10-14 years 6.84 19.08 15-24 years 13.36 32.44 25-34 years 13.31 45.75 35-44 years 17.00 62.75 45-54 years 14.73 77.48 55-64 years 9.56 87.04 65-74 years 7.14 94.18 75-84 years 4.43 98.61 85 years and over 1.39 100
So how can we describe this distribution, using the vocabulary we have so far? • What is the mode of this distribution? • What is the median? • What is the range?
Percentiles/quantiles • percentiles/quantiles are the value below which a given percentage of the cases fall • the median = the 50th percentile • quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases
Interquartile range • The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases • What is the interquartile range for the following distribution?
% Cum% 0-4 years 5.65 5.65 5-9 years 6.59 12.24 10-14 years 6.84 19.08 15-24 years 13.36 32.44 25-34 years 13.31 45.75 35-44 years 17.00 62.75 45-54 years 14.73 77.48 55-64 years 9.56 87.04 65-74 years 7.14 94.18 75-84 years 4.43 98.61 85 years and over 1.39 100
Continuous variable: • Central tendency • mean, median and mode • Dispersion • range • variance or standard deviation • quantiles/percentiles • interquartile range • Significance • standard error • coefficient of variation • Visualization • polygon (line graph)
Means, variances, and standard deviations: • Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases) • Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1) • Standard deviation: the value that cuts off 68% of the cases above or below the mean, in a normal distribution. It’s the square root of the variance, in the same metric as the variable.
Availability of continuous variables in Stats Can products: • Stats Can rarely publishes truly continuous variables in its aggregate statistics products • Some exceptions are: • age by single years (census) • estimates of population by single years of age (Annual demographic statistics)
Statistics Canada generally reports the distribution of continuous variables as: • Measures of central tendency • an ordinal (categorical) variable • an average (mean) and standard error, variance, or standard deviation • a median • rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile • Measures of dispersion • percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates • quantiles (eg quintiles in Income trends in Canada) • Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada) • Measures of significance • standard error
In the following distribution… • What is the median? • What is the mean? • What is the range?
Using the percentages and cumulative percentages: • What is your best estimate of the interquartile range? • What is your best estimate of the standard deviation? • Why is the average (mean) income so much higher than the median? • See pages 18-19 of your handout
Using the standard error to describe more of the distribution: • standard errorof the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large • computed from the standard deviation divided by the square root of the N • the larger the N, the smaller the standard error, and the more confidence we can have in the distribution in the sample as representative of the population
Confidence intervals • The standard error makes most sense when we use it to compute a confidence interval around the mean • For Canada, in the previous example: • 95% upper confidence limit (UCL) =(mean)+1.96(standard error) =29769+1.96(19)= 29769 + 37.24=$29,806.24 • 95% lower confidence limit (LCL) =(mean)-1.96(standard error) =29769 -1.96(19)= 29769 - 37.24 =$29,731.76 • 1.96 is the Z-statistic that represents 95% of a normal distribution • The handout contains computed confidence intervals for each of the provinces (p.22)
How do we interpret this? • if we draw repeated random samples from the same population, 95% of them will have a mean total income between $29,732 and $29,806 • this is not the same as saying that we are 95% confident that the population mean falls within those two limits.
Using microdata • Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables • SDA will only report these measures for variables with less than 8,000 values