SDA: a tool for teaching and research with microdata

SDA: a tool for teaching and research with microdata Laine Ruus <laine.ruus@utoronto.ca> University of Toronto. Data Library Service 2008-12-03, revised 2009-04-14 http://www.chass.utoronto.ca/misc/mun09/sda_intro.ppt

What this session covers: • Introduction • Demo of main SDA capabilities • Some tips and tricks • Advantages and disadvantages for teaching and research • Common questions about SDA

SDA@UT is brought to you by: • University of California, Berkeley. Computer-assisted Survey Methods Program (CSM) – writes and supports the server-side software • University of Toronto. Centre for Computing in the Humanities and Social Sciences (CHASS) – provides the hardware, buys the software, and provides system support wetware • University of Toronto. Libraries – provides the budget to purchase the data, and care, feeding and user support wetware • And Memorial University Libraries which subscribes to the service.

Our experience with SDA • CHASS installed SDA in the fall of 2004 • At last count, have 900+ data files in SDA • Some have only the metadata that was generated from the original syntax files (SAS/SPSS/Stata), but a number also have full question text. • Most are microdata, but a few are aggregate statistics (census files) • A number of voracious data users now expect to find the latest microdata released by Stat Can in SDA

Review of main SDA utilities • Frequencies, weighted & unweighted • Crosstabulations • Comparison of means (ANOVA) • Correlations • Regressions • Logit/probit regressions

Tips & tricks • Have we not gotten around to coding the missing values? • Want to include missing values in your cross-tabulation, or other analysis? • Collapsing uniform categories of continuous variables on the fly • Recoding variables on the fly

Problem: in this variable, we have not yet coded value ‘5’ as missing data. Therefore it would be included in analyses.

Solution: specify, after the variable name, only those values you want to include

Problem: to include values coded as missing in descriptive statistics or analyses This is a missing value. It will not be included in descriptive statistics or analyses.

Solution 1: specify, after the variable name, the lowest value thru **.

Solution 2: use ‘include missing data values’ under Table options

Solution 3: list the values explicitly after the variable name

Problem: to generate frequencies or a cross-tabulation of a continuous variable

Solution 1: collapse to uniform categories, defining a starting point c:30000,-30000 means: - collapse to uniform categories - each category should be 30000 in size - begin with value -30000

Solution 2: recode to desired categories. Note use of * to denote both lowest and highest values.

Tips & tricks (cont’d) • Computing percentages in aggregate data • Dummy coding variables in regressions • Defining an interaction on the fly

Problem: given a file of aggregate statistics, list percentages rather than counts. [NB use the Listcase program] These are all counts

Solution: define percentages in the Listcase program. Defines a percentage with v4 in the numerator and v2 in the denominator.

Problem: to use a categorical variable in a regression analysis, it needs to be ‘dummy’-coded (ie ‘1’ and ‘0’).

Solution: dummy-code categorical variables ‘on-the-fly’. Interactions can also be coded on-the-fly, including interactions with dummy-coded variables. Dummy coded: values 10-14 will be coded to ‘1’, all others will be ‘0’. Interaction involving a dummy coded variable and a continuous variable.

Advantages for teaching: • Stable environment, 24x7 access • Very easy to explain to novice users • Reduce/eliminates need for computer labs with statistical software • Allows you to each statistics rather than software • Students get hands on data quickly • Switch easily between weighted and unweighted distributions

Advantages for teaching (cont’d): • Measures of association and tests of significance comparable to SAS • Design effects, in files in which cluster and/or statum variables are available • Interactive demonstration of statistical concepts • Share recoded variables • Can quickly mount additional data to fulfill your teaching needs

Advantages for research: • Stable environment, 24x7 access • Access to latest available version of the data • Basic exploratory data analysis: eg are there enough cases for my subset? • Design effects, where cluster/sample variables available • Download data and import to SAS/SPSS/Stata on own workstation • Share recoded variables • Integrated variable descriptions (selected data files)

Advantages for data management: • Creates metadata from SAS/SPSS/Stata syntax or DDI format xml files • Very easy and fast to import files with good syntax files • Control over what users can and cannot do • Outputs include SAS/SPSS/Stata syntax or DDI format xml files • Overhead: size of uncompressed data + about 50%

Disadvantages of SDA: • Search for variables/values among data files not yet implemented at UT/CHASS • Can’t download created/recoded variables – coming in spring 2009 • Graphics minimal, eg no stem-and-leaf, box-plots etc • Doesn’t output SAS/SPSS/Stata system/export files, only raw data files plus syntax files • Little support for Study/File level metadata (DDI) • No support for nCubes (DDI 2)

How SDA compares to the competition • See table at: http://www.chass.utoronto.ca/datalib/misc/accoleds/2008/sda_compare.htm

Common questions from researchers & students: • When to weight versus not to weight • Does it only do cross-tabs? • But I want the raw data, not a cross-tabulation! • Differences between syntax, data, and system files.

An application we wouldn’t have tackled without SDA: • Q: I need the average expenditure on eye care in Canada by age group of household head for as long a time-period as possible. • A: Once we explained SDA, the student had generated this statistics from each of the FAMEX/SHS files, 1969-2004 in under 30 mins. (He knew only Stata.)

Functions we know to be coming in SDA • Among-file variable searching – already available but not yet implemented on CHASS • Downloading recoded variables • Will allow users to load own data files (Archiver in SDA 3.1) -- already available but not yet implemented on CHASS

Exercises: • First time SDA user? Try these exercises using the Census 2001 microdata on individuals • Experienced SDA user? Try these exercises using a variety of DLI data files

Questions: • Question 1: Where will I find the SDA server at University of Toronto? • Answer 1: The URL is: http://www.chass.utoronto.ca/datalib/ Select ‘Microdata analysis and extraction’

Question 2 How are files chosen to be mounted on the SDA server at UT? Answer 2 All significant Canadian microdata files, eg by Statistics Canada as released by DLI Other files based on your requests Questions (cont’d):

Question 3: My research is being done collaboratively with a colleague at another Canadian university. Can my colleague get access to SDA? Answer 3: SDA is available as a subscription service to other Canadian DLI-member universities and colleges. Current subscribers include: U of Victoria, Ryerson U, and Memorial U Questions (cont’d):

SDA: a tool for teaching and research with microdata