530 likes | 643 Views
GEOGRAPHY and WEIGHTS IN THE NLS. By Randall Olsen. The Plan for this Module. Basics of geographic data Geography and sampling Level of detail available in the NLS and GIS data How you get geographic data Weighting Correct standard errors and geo-variables
E N D
GEOGRAPHY and WEIGHTS IN THE NLS By Randall Olsen
The Plan for this Module • Basics of geographic data • Geography and sampling • Level of detail available in the NLS and GIS data • How you get geographic data • Weighting • Correct standard errors and geo-variables • Using GIS and family-level data to enhance your analysis plan
Basics of Geographic Data • Four Census Regions – this is the finest level of geography available for original cohorts without going to a Census Research Data Center. • State and Counties • Census Tracts – size varies, thousands of persons • Block Groups – neighborhoods (almost)
Counties • Finest level routinely available in NLSY79 and NLSY97 • 3100 in U.S. – Texas (254), Delaware (3), Georgia (159); FIPS codes to designate • Extensive socio-economic & demographic data available at county level (nee City-County Data Book) – you merge them in using FIPS codes
Census Tracts • In 2000 all U.S. partitioned into tracts • Size and population varies, but can contain several thousand people in urban areas or a few hundred in rural areas • Using these data requires clearance from BLS
Block Group • In Urban areas, consist of groups of blocks (did you guess?) • This is the finest level of aggregation for Census geography (building blocks for reapportionment) • What non-rural folk think of as a neighborhood, except for those near boundary of the block group
Sampling - Original Cohorts • Original Cohorts drawn from experimental CPS sample frame in 1960’s; Title 13 confidentiality restrictions prohibit release of geographic data below Census Region • We recently geocoded these data (latitude and longitude) and one may use them at a Census Research Data Center • The exact sampling structure was kept secret a la Raiders of the Lost Ark – details may exist in a musty file in Suitland, MD although the latitude and longitude data allows one to reverse engineer the sampling
Sampling for the NLSY’sMultiple stages • U.S. divided into Primary Sampling Units (PSUs) – Major Met areas, counties or groups of counties (rural areas) • Selection probability proportional to population of interest; large cities always chosen • Dividing PSUs into groups can insure correct fraction of rural & suburban areas chosen – this can reduce the sampling variance relative to a simple random sample (SRS)
Next Stages • Select Tracts or Block groups; list in order by income or ethnic composition to pick every nth one – insures even distribution over the ordered characteristics. Again, this can reduce sampling variance relative to a SRS. Segments of streets selected within block groups. • List all addresses in selected segments; randomly select units to do a screening interview to identify eligible persons • This process generates area “clusters” of nearly contiguous respondents
Examples of PSU Clustering • NLSY97: 100 PSUs in cross sectional sample. 100 PSUs in minority oversample. • NLSY79: 102 PSUs in cross sectional sample. 100 PSUs in oversample. 38 PSUs in Military oversample. • We average about 50 respondents per PSU; the effect of clustering on statistical properties increases with the size of the cluster and degree to which variables are correlated within cluster. • Correlations within clusters increase the sampling variance and usually overcome the advantages of stratification.
PSU’s in NLSY79 initial screening (done in 1978) – over yield PSU’s in NLSY97 initial screening (done in 1997 – screen and go) under yield
Geographic Detail in NLS • States and counties available in Geocode release • Zipcode data kept at BLS and CHRR • Census tracts and block group identifiers at CHRR and BLS • Latitude and longitude (accurate to about 50 feet) at CHRR
Geocode – How you get it • You need to apply to BLS (see Web site) • Describe how you plan to use the data • If BLS approves you, CHRR sends you a CD • You need to return the CD when finished and you are subject to audit and legal liabilities if you violate terms of agreement with BLS. BLS performs many audits – keep yourself in compliance.
Geocode – How you use it • You use the state and county codes to merge in the data you need • Use standard FIPS codes • There is a variable indicating when R is in a central city (this was done using zipcodes - before 1998 missing values show zips that are not unambiguously central/non-central) • Data merge is a do-it-yourself project
Zipcode Data • CD is at BLS or CHRR and is not released • The CD has Zipcodes, but matching and merging in the data you need is a do-it-yourself project • You can have CHRR create a variable you need, with BLS approval • Zipcode centroid can be used as rough location of respondent for simple distance calculations
Fine Level Location • Modern Geographic Information Systems data use latitude and longitude as the basis for linking data • We geocode respondent addresses with latitude and longitude; sometimes with GPS units (all years except 1980) • We place R within about 50 feet • Opportunities to extend analysis abound
Fast food restaurants Employers Doctors’ offices Hospitals Freeways Schools, public & private Post offices Banks Bus stops Train stations State licensed day care centers Drug seizures & prices Air quality measures Toxic waste sites Distance from R to:
Data at Tract and Block Group Level • Based on Decennial Census Long Form or American Community Survey (recent years) • Ethnicity and Color of people in area • Average income, poverty rate, dispersion in income, housing attributes • Population density, education, employment rates
Other Sensitive Data for Analysis • CHRR maintains the names of employers for each respondent in each round • With BLS approval we can identify persons working for a particular sort of employer or match in employer characteristics • The guiding principal is that these specialized extracts must not give you the ability to re-identify the respondent
Ideas using detailed geography • Does proximity to fast-food restaurants now and in the past correlate with BMI? • Does current and past air quality have a relationship to the incidence of asthma? • Does proximity to health care correlate with health outcomes? • Is local income inequality related to health?
Respondent location is generally chosen by respondent; this problem of “endogenous” location may be attenuated or “solved” using locational attributes at either screening or age 15 – locations reflecting primarily parental choice, not respondent choice. These past locational attributes can be used as either regressors or instrumental variables (IV). IV creates a variable that “stands in” for a regressor that is correlated with the error term.
Some respondent choices may be endogenous to an outcome, such as smoking and birth weight of one’s infant. One could use the incidence of smoking by one’s peers in the original PSU (or by one’s siblings) as an instrumental variable. Peer smoking reflects shared socio-economic forces, but weight of R’s baby unlikely to have an effect on smoking behavior of R’s peers. Need to avoid weak instruments, that is instruments that do not explain much of the variation in the variable they stand in for.
Using Fine-level Geography • Make application to BLS • CHRR can often create the variable for you if it does not threaten re-identification • Rounding data reduces precision and reduces threat of re-identification of tract, block group or zipcode • Do the analysis at BLS or CHRR
DIFFUSION OF THE SAMPLES: PSU Clusters in original NLSY97 Sample But this clustering has broken down over time. Here is where people live as of Round 6 in NLSY97
PSU Clusters in original NLSY79 Sample 12,000 By Round 20 in NLSY79 Sample there is even more geographic dispersion. 9,000
Example of Segment Clustering • In the NLSY97 a cluster of respondents were picked from the Lower East side of Manhattan and a cluster from around Yankee Stadium.
Implications of Sample Design for Routine Data Use • All NLS samples contain oversamples of Blacks and NLSY’s oversample Hispanics. Poor whites and military members have discontinued oversamples in NLSY79. • NLSY looks different from a Simple Random Sample • Clusters of R’s may share unobservable characteristics
Weighting • Weight summary statistics to describe population • For regressions – Gauss Markov rules • OLS is BLUE under standard conditions, including correct specification of the model • Model heterogeneity does not call for weighted regression but rather weighting the various regression coefficients
Using Weights • Weights for the NLSY97 Round 7 range from a high of 1,785,202 to a low of 90,060 – two implied decimal places • One respondent represents from 900 to 17,852 people, average is about 2,500 • Zero weights indicate person not interviewed • NLSY97 and NLSY79 have single round weights representing population in 1997 and 1978 – not immigrants since screening • NLSY97 has weights for cross section (no oversamples) as well as “panel” weights
A NLSY79 Example From 1994 • Blacks and Hispanics on average have lower wages than whites (see WeightingWageData.Sas). • Unweighted • Mean Wage $12.50 per hour • Median Wage $10.15 per hour • Weighted with 1994 sample weight (R50804.00) to correct for oversampling • Mean Wage $13.60 per hour • Median Wage $11.10 per hour • Weighting increases average wage by roughly $1.00 per hour
How Do I Weight Multiple Years? NLS has a custom weighting program that provides users with the ability to go beyond weighting just a single round • Web Version: http://www.nlsinfo.org/web-investigator. Allows you to weight a set of survey rounds. • PC-SAS Version: Allows you to use the code that runs the web version on your own PC. Enables you to weight any set of respondent ids. This allows you to take into account event history data and item non-response. This is a powerful tool.
PC-SAS Custom Weight Program • Contact NLS User Services. They will send you a pair of PC-SAS programs, a set of data files and an input file. Jay Zagorsky at CHRR will help you. • You must be comfortable making minor modifications to SAS programs and must have SAS installed on your computer. • Program takes as input a sorted list of ids, one id per line. Program produces same output as web version • This program allows you to weight data from an event history or other complex designs
Clustering & Standard Errors • NLS has numerous clusters of respondents who are alike; same person in different rounds, siblings, people in same neighborhood • Clustering means all observations are not independent (not i.i.d.) – heterogeneity across persons and families plus spatial correlation • PSU clustering more a problem than family clustering for variances d.e. = [1+p(k-1)] – (adjust s.e. by sqrt). Large k produces problems – clusters larger than families. But same person in different rounds means large p.
Clustering & Standard Errors(cont.) • If intra-cluster correlations are high, number of effective observations = number of clusters, not number of observations • OLS is still consistent and unbiased – must use GLS for correct standard errors • Design effects in regressions are perhaps better described as misspecification effects as the intracluster correlation is due to unobserved variables affecting the cluster
Household Clustering • NLSY97: 4,027 respondents came from homes that had multiple respondents. • There were six homes that each provided five respondents. • NLSY79: 5,914 respondents came from homes that had multiple respondents. • There were four homes that each provided six respondents. • Data on siblings allows us to separate effects of household versus individual characteristics • For original cohorts, refer to multiple respondent file to detect parents and children across cohorts and siblings both within and across cohorts
Effect of Clustering on Std Errors • NLSY79 to explain log of male hourly wage • Regress hourly wages on race, age, education, AFQT score and marital status. Details are in WageData.sas Y = XB + ui + vij + wijk + zijkt ui is error for PSU i, vij is component for PSU i and family j, wijk is component for PSU i, family j and person k, zijkt is idiosyncratic
OLS Results From SAS • Results using OLS with SAS. Note high T-values.
How To Fix Problem • There are at least two statistical packages designed to fix the clustering problem. • Sudaan (www.rti.org/sudaan) is a special purpose package designed to fix clustering issues. Integrates with SAS. • Stata (www.stata.com) is a general purpose statistical program. To adjust for clustering for means use the “Svyset” command; for regression use “robust cluster” (Huber-White). • No clustering data available for Original Cohorts
OLS Results From Sudaan Here we correct for the survey’s clustering on PSU (not on person or family)
What Happened? • Adjusting for clustering using Sudaan resulted in most of the T-values falling by half. Most are still highly significant. • The Hispanic variable, which was considered highly significant with the SAS results (Pr < 0.0001) is now no longer statistically significant (Pr < 0.15) by most commonly used levels. (Problem more severe with clustered characteristics)
What Steps Are Needed To Adjust? • First, get geocode clearance. You need this clearance to access replicate and PSU data. • Second, extract all variables for your research plus the replicate and PSU values. • NLSY79: The PSU variable is R02191.45, titled “Stratum Number For Primary Sampling Units” and the replicate variable is R02191.46, titled “Within Stratum Replicate Of Primary Sampling Unit.” PSU=10*R02191.45+R02191.46 • NLSY97: The PSU variables is R13082.00, titled “PRIMARY SAMPLING UNIT (CODED).” The replicate variable is not released. Set replicate=1 in your work.
What Steps Are Needed To Adjust? • Third, sort your data set by replicate and PSU. • Fourth run Sudaan. We used the following command. • Proc Regress data="C:\Documents and Settings\All Users\Desktop\ClusteringandWeighting\WageData.dbs" filetype=ascii design=wr DEFT1 est_no=24000; weight _ONE_; nest REPLICAT PSU / MISSUNIT; Model Ln_Pay = Black Hispanic Age HGC AFQT Marry;
Small Extension • The SAS file we used to create the previous example is called WageData.sas. • What happens when we add one more explanatory variable, “height in inches?” • Adding this variable investigates if taller people earn higher wages. • The created variable “height” is already part of the SAS data set.
Extension Results • Using SAS the OLS regression results show height’s coefficient is 0.004 and the t-value is 3.34. • In simple language this means each extra inch of height is associated with a 0.4% increase in hourly wages. The 3.34 T-value shows the coefficient is robust at the 99.9% level of significance, suggesting height and wages are definitely related. • Using Sudaan to take into account clustering lowers the T-value to 2.0. Sudaan computes the height’s coefficient significance level at 95%. Hence, adjusting for clustering means we no longer have almost complete statistical certainty in the relationship.
What If You Do Not Have Sudaan (or Stata)? • One method of getting roughly similar results is to add extra geographic variables which track each PSU’s characteristics to the regression. • Using just SAS we reran the wage function and included for each respondent’s 1979 location: percent black, percent Hispanic, median income, did the respondent reside in a SMSA of 2+ million people and dummies for USA regions (see the file named WageDataPlusGeoVariables.sas). • Note we get results much like Sudaan just using location characteristics that are 20 years old.