1 / 225

STK 4600: Statistical methods for social sciences.

STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals by Jan F. Bjørnstad. Survey sampling: 4 major topics. Traditional design- based statistical inference 7 weeks Likelihood considerations 1 week

mmarc
Download Presentation

STK 4600: Statistical methods for social sciences.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STK 4600: Statistical methods for social sciences. Survey sampling and statistical demography Surveys for households and individuals by Jan F. Bjørnstad

  2. Survey sampling: 4 major topics • Traditional design-basedstatisticalinference • 7 weeks • Likelihoodconsiderations • 1 week • Model-basedstatisticalinference • 3 weeks • Missing data - nonresponse • 2 weeks

  3. Statistical demography • Mortality • Life expectancy • Population projections • 2 weeks

  4. Course goals • Give students knowledge about: • planning surveys in social sciences • major sampling designs • basic concepts and the most important estimation methods in traditional applied survey sampling • Likelihood principle and its consequences for survey sampling • Use of modeling in sampling • Treatment of nonresponse • A basic knowledge of demography

  5. But first: Basic concepts in sampling Population (Target population): The universe of all units of interest for a certain study • Denoted, with N being the size of the population: U = {1, 2, ...., N} All units can be identified and labeled • Ex: Political poll – All adults eligible to vote • Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more • Ex: Consumer expenditure : Unit = household Sample: A subset of the population, to be observed. The sample should be ”representative” of the population

  6. Sampling design: • The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample • We shall only be concerned with probability sampling • Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample. • The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population:

  7. Basic statistical problem: Estimation • A typical survey has many variables of interest • Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population • Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed. For each person i (at least 15 years old) in Norway:

  8. In general, variable of interest: y with yiequal tothe value of y for unit i in the population, and the total is denoted • The typical problem is to estimate t or t/N • Sometimes, of interest also to estimate ratios of totals: Example- estimating the rate of unemployment: Unemployment rate:

  9. Sources of error in sample surveys • Target population U vs Frame population UF Access to the population is thru a list of units – a register UF . U and UFmay not be the same: Three possible errors in UF: • Undercoverage: Some units in U are not in UF • Overcoverage: Some units in UFare not in U • Duplicate listings: A unit in U is listed more than once in UF • UFis sometimes called the sampling frame

  10. Nonresponse - missing data • Some persons cannot be contacted • Some refuse to participate in the survey • Some may be ill and incapable of responding • In postal surveys: Can be as much as 70% nonresponse • In telephone surveys: 50% nonresponse is not uncommon • Possible consequences: • Bias in the sample, not representative of the population • Estimation becomes more inaccurate • Remedies: • imputation, weighting

  11. Measurement error – the correct value of yiis not measured • In interviewer surveys: • Incorrect marking • interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use • misunderstanding of the question, do not remember correctly.

  12. Sampling «error» • The error (uncertainty, tolerance) caused by observing a sample instead of the whole population • To assess this error- margin of error: measure sample to sample variation • Design approach deals with calculating sampling errors for different sampling designs • One such measure: 95% confidence interval: If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t

  13. The first 3 errors: nonsampling errors • Can be much larger than the sampling error • In this course: • Sampling error • nonresponse bias • Shall assume that the frame population is identical to the target population • No measurement error

  14. Summary of basic concepts • Population, target population • unit • sample • sampling design • estimation • estimator • measure of bias • measure of variance • confidence interval

  15. survey errors: • register /frame population • mesurement error • nonresponse • sampling error

  16. Example – Psychiatric Morbidity Survey 1993 from Great Britain • Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services • Target population: Adults aged 16-64 living in private households • Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen • 200 interviewers, each visiting 90 households

  17. Result of the sampling process • Sample of addresses 18,000 Vacant premises 927 Institutions/business premises 573 Demolished 499 Second home/holiday flat 236 • Private household addresses 15,765 Extra households found 669 • Total private households 16,434 Households with no one 16-64 3,704 • Eligible households 12,730 • Nonresponse 2,622 • Sample 10,108 households with responding adults aged 16-64

  18. Why sampling ? • reduces costs for acceptable level of accuracy (money, manpower, processing time...) • may free up resources to reduce nonsampling error and collect more information from each person in the sample • ex: 400 interviewers at $5 per interview: lower sampling error 200 interviewers at 10$ per interview: lower nonsampling error • much quicker results

  19. When is sample representative ? • Balance on gender and age: • proportion of women in sample @ proportion in population • proportions of age groups in sample @ proportions in population • An ideal representative sample: • A miniature version of the population: • implying that every unit in the sample represents the characteristics of a known number of units in the population • Appropriate probability sampling ensures a representative sample ”on the average”

  20. Alternative approaches for statistical inference based on survey sampling • Design-based: • No modeling, only stochastic element is the sample s with known distribution • Model-based: The values yiare assumed to be values of random variables Yi: • Two stochastic elements: Y = (Y1, …,YN) and s • Assumes a parametric distribution for Y • Example : suppose we have an auxiliary variable x. Could be: age, gender, education. A typical model is a regression of Yi on xi.

  21. Statistical principles of inference imply that the model-based approach is the most sound and valid approach • Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. • Is the easy way out: Do not need to model. All statisticians working with survey sampling in practice need to know this approach

  22. Design-based statistical inference • Can also be viewed as a distribution-free nonparametric approach • The only stochastic element: Sample s, distribution p(s) for all subsets s of the population U={1, ..., N} • No explicit statistical modeling is done for the variable y. All yi’s are considered fixed but unknown • Focus on sampling error • Sets the sample survey theory apart from usual statistical analysis • The traditional approach, started by Neyman in 1934

  23. Estimation theory-simple random sample SRS of size n: Each sample s of size n has Can be performed in principle by drawing one unit at time at random without replacement Estimation of the population mean of a variable y: A natural estimator - the sample mean: Desirable properties:

  24. The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE): Some results for SRS:

  25. usually unimportant in social surveys: n =10,000 and N = 5,000,000: 1- f = 0.998 n =1000 and N = 400,000: 1- f = 0.9975 n =1000 and N = 5,000,000: 1-f = 0.9998 • effect of changing n much more important than effect of changing n/N

  26. The estimated variance Usually we report the standard error of the estimate: Confidence intervals for mis based on the Central Limit Theorem:

  27. Example – Student performance in California schools • AcademicPerformance Index (API) for all California schools • Basedonstandardized testing of students • Data from all schoolswith at least 100 students • Unit in population = school (Elementary/Middle/High) • Full populationconsistsofN = 6194 observations • Concentrateonthe variable: y = api00 = API in 2000 • Mean(y) = 664.7 with min(y) =346 and max(y) =969 • Data setin R: apipop and y= apipop$api00

  28. Histogram of y population with fitted normal density

  29. Histogram for sample mean and fitted normal densityy = api scores from 2000. Sample size n =10, based on 10000simulations R-code: >b =10000 >N=6194 >n=10 >ybar=numeric(b) >for (k in 1:b){ +s=sample(1:N,n) +ybar[k]=mean(y[s]) +} >hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE) >x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05) >z=dnorm(x,mean(ybar),sqrt(var(ybar))) >lines(x,z)

  30. Histogram and fitted normal densityapi scores. Sample size n =10, based on 10000 simulations

  31. y = api00 for 6194 California schools 10000 simulations of SRS. Confidence level of the approximate 95% CI

  32. For one sample of size n = 100: For one sample of size n = 100 R-code: >s=sample(1:6194,100) > ybar=mean(y[s]) > se=sqrt(var(y[s])*(6194-100)/(6194*100)) > ybar [1] 654.47 > var(y[s]) [1] 16179.28 > se [1] 12.61668

  33. Absolute value of sampling error is not informative when not related to value of the estimate For example, SE =2 is small if estimate is 1000, but very large if estimate is 3 The coefficient of variation for the estimate: • A measure of the relative variability of an estimate. • It does not depend on the unit of measurement. • More stable over repeated surveys, can be used for planning, for example determining sample size • More meaningful when estimating proportions

  34. Estimation of a population proportion pwith a certain characteristic A p = (number of units in the population with A)/N Let yi = 1 if unit i has characteristic A, 0 otherwise Then p is the population mean of the yi’s. Let X be the number of units in the sample with characteristic A. Then the sample mean can be expressed as

  35. So the unbiased estimate of the variance of the estimator:

  36. Examples A political poll: Suppose we have a random sample of 1000 eligible voters in Norway with 280 saying they will vote for the Labor party. Then the estimated proportion of Labor votes in Norway is given by: Confidence interval requires normal approximation. Can use the guideline from binomial distribution, when N-n is large:

  37. In this example : n = 1000 and N = 4,000,000 Ex: Psychiatric Morbidity Survey 1993 from Great Britain p = proportion with psychiatric problems n = 9792 (partial nonresponse on this question: 316) N @ 40,000,000

  38. General probability sampling • Sampling design: p(s) - known probability of selection for each subset s of the population U • Actually: The sampling design is the probability distributionp(.) over all subsets of U • Typically, for most s: p(s) = 0 . In SRS of size n, all s with size different from n has p(s) = 0. • The inclusion probability:

  39. Illustration U = {1,2,3,4} Sample of size 2; 6 possible samples Sampling design: p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8 The inclusion probabilities:

  40. Some results

  41. Estimation theory probability sampling in general Problem: Estimate a population quantity for the variable y For the sake of illustration: The population total

  42. CV is a useful measure of uncertainty, especially when standard error increases as the estimate increases Because, typically we have that

  43. Some peculiarities in the estimation theory Example: N=3, n=2, simple random sample

  44. For this set of values of the yi’s:

  45. Let y be the population vector of the y-values. This example shows that is not uniformly best ( minimum variance for all y) among linear design-unbiased estimators Example shows that the ”usual” basic estimators do not have the same properties in design-based survey sampling as they do in ordinary statistical models In fact, we have the following much stronger result: Theorem: Let p(.) be any sampling design. Assume each yi can take at least two values. Then there exists no uniformly best design-unbiased estimator of the total t

  46. Proof: This implies that a uniformly best unbiased estimator must have variance equal to 0 for all values of y, which is impossible

  47. Determining sample size • The sample size has a decisive effect on the cost of the survey • How large n should be depends on the purpose for doing the survey • In a poll for detemining voting preference, n = 1000 is typically enough • In the quarterly labor force survey in Norway, n = 24000 Mainly three factors to consider: • Desired accuracy of the estimates for many variables. Focus on one or two variables of primary interest • Homogeneity of the population. Needs smaller samples if little variation in the population • Estimation for subgroups, domains, of the population

  48. It is often factor 3 that puts the highest demand on the survey • If we want to estimate totals for domains of the population we should take a stratified sample • A sample from each domain • A stratified random sample: From each domain a simple random sample

  49. Assume the problem is to estimate a population proportion p for a certain stratum, and we use the sample proportion from the stratum to estimate p Let n be the sample size of this stratum, and assume that n/N is negligible Desired accuracy for this stratum: 95% CI for p should be The accuracy requirement:

  50. The estimate is unkown in the planning fase Use the conservative size 384 or a planning value p0 with n = 1536 p0(1- p0 ) F.ex.: With p0 = 0.2: n = 246 In general with accuracy requirement d, 95% CI

More Related