970 likes | 988 Views
STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals Anders Holmberg and Li-Chun Zhang (based on original notes by) Jan F. Bjørnstad. Survey sampling: 4 major topics.
E N D
STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals Anders Holmberg and Li-Chun Zhang (based on original notes by) Jan F. Bjørnstad
Survey sampling: 4 major topics • Traditional design-basedstatisticalinference • 5 weeks • Likelihoodconsiderations • 1 week • Model-basedstatisticalinference • 2 weeks • Missing data - nonresponse • 1 weeks
Statistical demography • Mortality • Life expectancy • Population projections • 1 week
Course goals • Knowledge about: • planning surveys in social sciences • major sampling designs • basic concepts and the most important estimation methods in traditional applied survey sampling • Likelihood principle and its consequences for survey sampling • Use of modeling in sampling • Treatment of nonresponse • A basic knowledge of demography
What is a Survey? • The need for statistical information seems endless in modern society. • One important mode for data collection is a sample survey • In many countries, a central statistical office is mandated by law to provide statistical information about the state of the nation and surveys are an important part of this activity.
For example, in Canada, the 1971 Statistics Act mandates Statistics Canada to, ”… collect, compile, analyze, abstract, and publish statistical information relating to commercial, industrial, financial, social, economic, and general activities and condition of the people.”
What is a survey?(Dalenius 7 points) 1. A survey concerns a set of objects comprising a population • A finite set of objects like individuals, businesses or farms • Events occurring at specified time intervals, like crimes and accidents. • Processes in the environment, like land use or the occurrences of wildlife species in an area. 2. The population under study has one or more measurable properties.
3. The goal is to describe the population by one ore more parameters defined in terms of the measurable properties. 4. To get observational access to the population, a frame, is needed. 5. A sample of objects is selected from the frame in accordance with a sampling design that specifies a probability mechanism and a sample size.
6. Observations are made on the sample in accordance with a measurement process. 7. Based on the measurements, an estimation process is applied to compute estimates of the parameters when making inference from the sample to the population.
Example • Labor force surveys • Population • Domains of interest • Variables • Population characteristics of interest • Sample
What is a Survey?(ASA) • The word is most often used to describe a method of gathering information from a sample of individuals. • The sample is usually just a fraction of the population being studied. • Data can be collected in many ways – including telephone, by mail, by web or in person. • The size of the sample depends on the purpose of the study.
The sample is scientifically chosen so that each unit in the population will have a measurable chance (>0) of selection. • Information is collected by means of standardized procedures. • The individual respondents should never be identified in the result. The results should be presented in completely anonymous summaries.
ASA (cont.) • How large must the sample size be? • What are some common survey methods? • What survey questions do you ask? • What about confidentiality and integrity?
How to plan a survey The first step is to lay out the objectives of the investigation. What do we want to know? Defining the target population. Determine the mode of administration. Developing the questionnaire Designing the sampling approach
General Problem Phases of a Survey Statistical Problem Population Variables Tabulation Plan Frame Sample Method of Measurement Measurement Instrument Data Collection Coding, Data Entry Editing, Updating Quality, Documentation Estimation/Tabulation Analysis Publication
How to plan a survey, cont. • How to plan a survey questionnaire? • How to get good coverage? • How to choose a random sample? • How to plan in quality? • How to schedule? • How to budget?
How to collect survey data? • Mail surveys, telephone surveys, internet, interviewing, mixed mode … • CATI, CAPI. • Failure to follow up non-respondents can ruin an otherwise well-designed survey. • Murphy’s Law: “If anything can go wrong it will” … but “If you didn’t check on it, it did”.
Margin of errors • An estimate from a survey is unlikely to exactly equal the true population quantity of interest. • The “margin of error” is a common summary of sampling errors that quantifies uncertainty about a survey result. • The sampling error as well as the non-sampling error in a survey will affect the margin of errors.
Summary Unfortunately, there are no absolute criteria to dictate the best choice of mode, questionnaire design, data collection protocol, and so on to use in each situation. Rather survey design is guided more by past experience, theories, and good advice on the advantage and disadvantages of alternative of alternative design choices so that we can make intelligent decisions for each situation we encounter. The aim of a good design is to use practical and reliable processes whose outcomes are reasonably predictable.
Basic concepts in sampling Population (Target population): The universe of all units of interest for a certain study • Denoted, with N being the size of the population: U = {1, 2, ...., N} All units can be identified and labeled • Ex: Political poll – All adults eligible to vote • Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more • Ex: Consumer expenditure : Unit = household Sample: A subset of the population, to be observed. The sample should be ”representative” of the population
Sampling design: • The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample • We shall only be concerned with probability sampling • Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample. • The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population:
Basic statistical problem: Estimation • A typical survey has many variables of interest • Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population • Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed. For each person i (at least 15 years old) in Norway:
In general, variable of interest: y with yiequal tothe value of y for unit i in the population, and the total is denoted • The typical problem is to estimate t or t/N • Sometimes, of interest also to estimate ratios of totals: Example- estimating the rate of unemployment: Unemployment rate:
Sampling a finite population UF s r U
Rules of Association Example? One-to-One Example? Many-to-One Example? One-to-Many Many-to-Many Example?
Important properties of a frame • The frame must be virtually complete: it must provide observational access to “almost all” objects in the target population. What degree of coverage is sufficient may be a matter of judgment. • The frame must serve to yield a sample of objects, which can be unambiguously identified. • The frame must be such that it is possible to determine how the units in the frame are associated with objects in the population. The statistician must know the exact chance that the sampled object had of being selected.
Desirable properties of a frame • The frame should be simple to use. • The frame should contain “auxiliary information” to be used in the estimation process. • The frame should be reasonable stable in time. Moreover it should be easy and inexpensive to update the frame.
Sources of error in sample surveys • Coverageerrors TargetpopulationUvsFramepopulationUF Access to thepopulation is thru a list of units – a register UF . U and UFmay not be the same: Three possibleerrors in UF: • Undercoverage: Some units in U are not in UF • Overcoverage: Some units in UFare not in U • Duplicate listings: A unit in U is listed more thanonce in UF • UFis sometimescalledthe sampling frame
Nonresponse - missing data • Some persons cannot be contacted • Some refuse to participate in the survey • Some may be ill and incapable of responding • In postal surveys: Can be as much as 70% nonresponse • In telephone surveys: 50% nonresponse is not uncommon • Possible consequences: • Bias in the sample, not representative of the population • Estimation becomes more inaccurate • Remedies: • imputation, weighting
Measurement error – the correct value of yiis not measured • In interviewer surveys: • Incorrect marking • interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use • misunderstanding of the question, do not remember correctly.
Sampling «error» • The error (uncertainty, tolerance) caused by observing a sample instead of the whole population • To assess this error- margin of error: measure sample to sample variation • Design approach deals with calculating sampling errors for different sampling designs • One such measure: 95% confidence interval: If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t
The first 3 errors: nonsampling errors • Can be much larger than the sampling error • In this course: • Sampling error • Nonresponse bias • Shall assume that the frame population is identical to the target population • No measurement error
Summary of basic concepts • Population, target population • unit • sample • sampling design • estimation • estimator • measure of bias • measure of variance • confidence interval
survey errors: • register /frame population • mesurement error • nonresponse • sampling error
Example – Psychiatric Morbidity Survey 1993 from Great Britain • Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services • Target population: Adults aged 16-64 living in private households • Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen • 200 interviewers, each visiting 90 households
Result of the sampling process • Sample of addresses 18,000 Vacant premises 927 Institutions/business premises 573 Demolished 499 Second home/holiday flat 236 • Private household addresses 15,765 Extra households found 669 • Total private households 16,434 Households with no one 16-64 3,704 • Eligible households 12,730 • Nonresponse 2,622 • Sample 10,108 households with responding adults aged 16-64
Why sampling ? • reduces costs for acceptable level of accuracy (money, manpower, processing time...) • may free up resources to reduce nonsampling error and collect more information from each person in the sample • ex: 400 interviewers at $5 per interview: lower sampling error 200 interviewers at 10$ per interview: lower nonsampling error • much quicker results
When is sample representative ? • Balance on gender and age: • proportion of women in sample @ proportion in population • proportions of age groups in sample @ proportions in population • An ideal representative sample: • A miniature version of the population: • implying that every unit in the sample represents the characteristics of a known number of units in the population • Appropriate probability sampling ensures a representative sample ”on the average”
Alternative approaches for statistical inference based on survey sampling • Design-based: • No modeling, only stochastic element is the sample s with known distribution • Model-based: The values yiare assumed to be values of random variables Yi: • Two stochastic elements: Y = (Y1, …,YN) and s • Assumes a parametric distribution for Y • Example : suppose we have an auxiliary variable x. Could be: age, gender, education. A typical model is a regression of Yi on xi.
Statistical principles of inference imply that the model-based approach is the most sound and valid approach • Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. • Is the easy way out: Do not need to model. All statisticians working with survey sampling in practice need to know this approach
Design-based statistical inference • Can also be viewed as a distribution-free nonparametric approach • The only stochastic element: Sample s, distribution p(s) for all subsets s of the population U={1, ..., N} • No explicit statistical modeling is done for the variable y. All yi’s are considered fixed but unknown • Focus on sampling error • Sets the sample survey theory apart from usual statistical analysis • The traditional approach, started by Neyman in 1934
Estimation theory-simple random sample SRS of size n: Each sample s of size n has Can be performed in principle by drawing one unit at time at random without replacement Estimation of the population mean of a variable y: A natural estimator - the sample mean: Desirable properties:
The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE): Some results for SRS:
usually unimportant in social surveys: n =10,000 and N = 5,000,000: 1- f = 0.998 n =1000 and N = 400,000: 1- f = 0.9975 n =1000 and N = 5,000,000: 1-f = 0.9998 • effect of changing n much more important than effect of changing n/N
The estimated variance Usually we report the standard error of the estimate: Confidence intervals for mis based on the Central Limit Theorem:
Example – Student performance in California schools • AcademicPerformance Index (API) for all California schools • Basedonstandardized testing of students • Data from all schoolswith at least 100 students • Unit in population = school (Elementary/Middle/High) • Full populationconsistsofN = 6194 observations • Concentrateonthe variable: y = api00 = API in 2000 • Mean(y) = 664.7 with min(y) =346 and max(y) =969 • Data set in R: apipop and y= apipop$api00
Histogram for sample mean and fitted normal densityy = api scores from 2000. Sample size n =10, based on 10000simulations R-code: >b =10000 >N=6194 >n=10 >ybar=numeric(b) >for (k in 1:b){ +s=sample(1:N,n) +ybar[k]=mean(y[s]) +} >hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE) >x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05) >z=dnorm(x,mean(ybar),sqrt(var(ybar))) >lines(x,z)
Histogram and fitted normal densityapi scores. Sample size n =10, based on 10000 simulations