350 likes | 485 Views
14b. Accessing Data Files in SAS ®. Prerequisites. Recommended modules to complete before viewing this module 1. Introduction to the NLTS2 Training Modules 2. NLTS2 Study Overview 3. NLTS2 Study Design and Sampling NLTS2 Data Sources, either 4. Parent and Youth Surveys or
E N D
Prerequisites • Recommended modules to complete before viewing this module • 1. Introduction to the NLTS2 Training Modules • 2. NLTS2 Study Overview • 3. NLTS2 Study Design and Sampling • NLTS2 Data Sources, either • 4. Parent and Youth Surveys or • 5. School Surveys, Student Assessments, and Transcripts • NLTS2 Documentation • 10. Overview • 11. Data Dictionaries • 12. Quick References
Overview • Purpose • Opening and viewing data files • Limiting variables • Subsetting cases • Joining/combining data files • Summary • Closing • Important information
NLTS2 restricted-use data NLTS2 data are restricted. Data used in these presentations are from a randomly selected subset of the restricted-use NLTS2 data. Results in these presentations cannot be replicated with the NLTS2 data licensed by NCES.
Purpose • Learn to • Open a data file • See what is in a file (i.e., contents of the file) • “Size” a data file for a perfect fit • Reduce the number of variables • Reduce the number of cases (i.e., subset the data) • Combine information from multiple sources • Bring in data from another source or another wave • Join or combine files • Create a new file
Open and view data files • SAS® and SPSS ® data are in separate folders. • SAS data files have a “.sas7bdat” extension. • Associated value formats are stored in a SAS library, “Formats.sas7bcat.” • SAS programming code is available for re-creating the user-defined formats in the SAS format library. • Hyperlinked from the table of contents on the database and documentation disk.
Open and view data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Files are either read from or written to. • Files have a name and a location where they are stored. • SAS needs to know the name of the file and where to find it. • SAS uses a LIBNAME statement to identify the path. • The path describes the nesting of folders. • C:\myprojects\NLTS2\Data is a path or location. • i.e., the file is located on Drive “C”, in the folder “Data,” which is nested inside the “myprojects” and “NLTS2” folders. • An example LIBNAME statement for a path would be LIBNAME sasdb 'C:\myprojects\NLTS2\Data' ;
Open and view data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • An easy way to view the contents of the file is PROC CONTENTS. • Specify LIBNAME statements for the data files and format library. LIBNAME [ddname] '[path]' ; LIBNAME library '[path]' ; • Syntax PROC CONTENTS DATA = [ddname].[file name] ; RUN ;
Viewing files Use the Wave 1 teacher file Run a PROC CONTENTS. Open and View Data Files: Example These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data.
Open and view data files: Example These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data.
Viewing files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data.
Limiting variables These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • How to reduce the number of variables in the file • Large files with many cases and many variables are unwieldy; simplify. • Fewer variables to search through. • Fewer cases to process. • Create data files that are limited to just those variables needed for analysis. • You have the choice of drop or keep. • Which one is best? The one that requires less typing! • If you are dropping more variables than keeping, use “keep” and vice versa.
Limiting variables These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Note: When making changes to your data • Use work files for temporary changes. • Work files are files that are in existence only for the duration of the program or SAS interactive session. • Work files are temporary files unless they are saved. • Work files have a one-level name—no ddname needed • To save modified data permanently, create a new permanent data file. • Usually it is best to create a new file rather than to modify the source file.
Limiting variables These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Syntax DATA Par_w1_lmt_vars (DROP= np1E2a np1GroupMember); SET sasdb.n2w1parent(KEEP= ID w1_DisHdr2001 w1_GendHdr2001w1_IncomeHdr2001 w1_AgeHdr2001 np1Weight np1HealthProb np1GroupMember np1ProblemCount np1E2a np1B2a) ; run ; • A “KEEP” or “DROP” option on a “SET” or “MERGE” statement controls which variables come into the data set. • A “KEEP” or “DROP” option on a “DATA” statement controls which variables are saved on the output data set. • Notice that SAS statements end in a semicolon.
Limiting Variables: Example • w2_dis12 • w2_gend2 • na_grade4 • w2_incm3 • wt_na These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Limiting variables • Create a file with fewer variables. • Create a new file called “PrScores” from n2w2dirassess. • Keep only the following variables: • ID • ndacalc_pr • ndaPC_PR • ndasyn_pr • NDaF1_friend • na_age4 • Save the new file • Review the new file with a PROC CONTENTS
Limiting variables: Example These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data.
Subsetting cases These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • How to reduce the number of cases in a data step or in a procedure • Often analysis is done on a subset; for example: • Select only youth with visual impairment. • Select only youth who are out of secondary school. • Exclude younger students.
Subsetting cases These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Example: Limit Wave 4 parent/youth interview data to those who are 21 or older, excluding youth who are 19 or 20 (W4_Age2007 = 19 or 20). • Syntax to limit cases in a file DATA AgeGT20 ; SET sasdb.n2w4paryouth (WHERE=(W4_Age2007>20)); run; or DATA AgeGT20 ; SET sasdb.n2w4paryouth ; IF W4_Age2007>20; run ;
Subsetting cases These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Syntax to limit cases in a procedure • No change to file with this syntax PROC FREQ data=sasdb.n2w4paryouth ; WHERE W4_Age2007>20 ; TABLES W4_Age2007 ; run ;
Subsetting cases: Example These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Subsetting cases • Create a small data set with a subset of cases. • Use “PrScores” created in previous example. • Limit cases to those classified with hearing. impairment only, i.e., those with a value of “5” for “w2_dis12”. • Look at notes in the log window. • Are there any clues that the file has changed?
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • How to bring in data from another file • Purpose • Learn to combine or join files • Bring in data from another source • Bring in data from another wave • Learn what to watch for • Number of cases in the combined file • How cases are joined • Key variable, i.e., which variable to match on • Keyed file, i.e., which cases to keep
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Why do this? • Often it is necessary to combine information from different files to perform comparative analyses, create new variables, or measure differences over time. • For example, you may want to • Create composite variables from multiple sources. • Look at similar items at different points in time.
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Example of composite variables from multiple sources. • Create a variable for “if parent attended a parent/teacher conference” using Wave 2 teacher survey item nts2C8, and fill in with parent interview item np2E1a_d if teacher data are missing. • Example of items at different points in time. • Create a variable to look at the pattern of employment between waves 2 and 3: employed both waves, either wave, or neither wave. • Set to “employed both waves” if np2HasPdJob (W2) and np3HasJob (W3) are “yes.” • Else set to “employed in either wave” if np2HasPdJob or np3HasJob are “yes.” • Else set to “not employed” if np2HasPdJob and np3HasJob are “no.”
Joining/combining data files There will be missing records across files and missing items within files. If data look like this, ask which file is the main filebeing analyzed. These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data.
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Data in all files must be sorted by the key variable. • The key variable matches files case by case. • Key variable is “ID.” • Files on CD should be sorted by key variable, but as you work with files they may become unsorted. • Syntax to sort data PROC SORT data=sasdb.PrScores ; by ID ; run ; • Note for those who do not want to destroy data • If you use a KEEPor DROPstatement, do a temporary sort, or otherwise change the data, save to a new file.
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Syntax to join files DATA Emplmt ; MERGE sasdb.n2w1parent (KEEP=id np1i_3a_7) sasdb.n2w2paryouth (KEEP=id np2HasPdJob) sasdb.n2w3paryouth (KEEP=id np3HasJob) sasdb.n2w4paryouth (KEEP=id np4HasJob) ; BY ID ; run ;
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. SAS log
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Syntax to join files keeping only those who have Wave 4 data DATA Emplmt ; MERGE sasdb.n2w1parent (KEEP=id np1i_3a_7) sasdb.n2w2paryouth (KEEP=id np2HasPdJob) sasdb.n2w3paryouth (KEEP=id np3HasJob) sasdb.n2w4paryouth (KEEP=id np4HasJob in=inW4) ; BY ID ; IF inW4 ; run ;
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. SAS log
Joining/combining data files These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Suggestion: Name files in a meaningful way, such as • By date: AnFile_29July • By type of Analysis: PI_CrossWave • By source: PI_W123 • By sequence: File_5 • Be sure to use a two-level data set name (using a ddname) to save the file; otherwise, it is a work or temporary file. DATA Emplmt ; /* temporary work file */ DATA sasdb.Emplmt; /* saved file */
Joining/combining data files:Example These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data. • Joining/combining data • Combine data from another file with an existing file. • Sort PrScores by ID. • Bring in np2HasPdJob from n2w2paryouth. • Bring in np3HasJob from n2w3paryouth. • Save the file as PrScoresEmp.
Joining/combining data files:Example These results cannot be replicated with full dataset; all outputin modules generated with a random subset of the full data.
Summary • Congratulations! You have learned to • Open and view a file • Create a new file • Reduce the size of files by specifying the • Variables needed • Cases needed • Join files using a key variable • Save files with a new name
Closing • Topics discussed in this module • Purpose • Opening and viewing data files • Limiting variables • Subsetting cases • Joining/combining data files • Next module • 15b. Accessing Data: Frequencies in SAS
Important information • NLTS2 website contains reports, data tables, and other project-related information http://nlts2.org/ • Information about obtaining the NLTS2 database and documentation can be found on the NCES website http://nces.ed.gov/statprog/rudman/ • General information about restricted data licenses can be found on the NCES websitehttp://nces.ed.gov/statprog/instruct.asp • E-mail address: nlts2@sri.com