1 / 31

Lesson 2

Lesson 2. Topic - Reading in data Programs 1 and 2 in course notes Chapter 2 (Little SAS Book). Raw Data. Read in Data Process Data (Create new variables) Output Data (Create SAS Dataset). Data Step. Analyze Data Using Statistical Procedures. PROCs. Raw Data Sources.

zuwena
Download Presentation

Lesson 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lesson 2 • Topic - Reading in data • Programs 1 and 2 in course notes • Chapter 2 (Little SAS Book)

  2. Raw Data Read in Data Process Data (Create new variables) Output Data (Create SAS Dataset) Data Step Analyze Data Using Statistical Procedures PROCs

  3. Raw Data Sources • You type it in the SAS program • Text file • Spreadsheet (Excel) • Database (Access, Oracle) • SAS dataset

  4. Data in Text Files Text files are simple character files that you can create or view in a text editor like Notepad. They may also be created as “dumps” from spreadsheet files like excel. • Delimited data – variables are separated by a special character (e.g. a comma) • Fixed position – data is organized into columns

  5. Data delimited with spaces: C 84 138 93 143 D 89 150 91 140 A 78 116 100 162 A . . 86 155 C 81 145 86 140 • Note: Missing data is identified with a period.

  6. Data delimited with commas C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,.,.,86,155 C,81,145,86,140 • Note: Missing data is identified with a period.

  7. Data delimited by commas(.csv file) C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,,,86,155 C,81,145,86,140 • Note: Missing data is identified by multiple commas.

  8. Column Data C084138093143 D089150091140 A078116100162 A 086155 C081145086140 • Note: Missing data values are blank.

  9. INFILE and INPUT Statements When you write a SAS program to read in raw data, you’ll use two key statements: • The INFILE statement tells SAS where to find the data and how it is organized. • The INPUT statement tells SAS which variables to read-in

  10. Program 1 * List Directed Input: Reading data values separated by spaces; DATA bp; INFILE DATALINES; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C 84 138 93 143 D 89 150 91 140 A 78 116 100 162 A . . 86 155 C 81 145 86 140 ; RUN ; TITLE'Data Separated by Spaces'; PROCPRINTDATA=bp; RUN; Obs clinic dbp6 sbp6 dbpbl sbpbl 1 C 84 138 93 143 2 D 89 150 91 140 3 A 78 116 100 162 4 A . . 86 155 5 C 81 145 86 140

  11. PARTIAL SASLOG 1 DATA bp; 2 INFILE DATALINES; 3 INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; 4 DATALINES; NOTE: The data set WORK.BP has 5 observations and 5 variables. NOTE: DATA statement used: real time 0.39 seconds cpu time 0.03 seconds

  12. * List Directed Input: Reading data values separated by commas; DATA bp; INFILE DATALINES DLM = ',' ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,.,.,86,155 C,81,145,86,140 ; RUN ; TITLE'Data separated by a comma'; PROCPRINTDATA=bp; RUN;

  13. * List Directed Input: Reading .csv files DATA bp; INFILE DATALINES DLM = ','DSD ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,,,86,155 C,81,145,86,140 ; TITLE 'Reading in Data using the DSD Option'; PROCPRINTDATA=bp; RUN; Consecutive commas indicate missing data

  14. * List Directed Input: Reading data values separated by tabs (.txt files); DATA bp; INFILE DATALINES DLM = '09'xDSD; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES; C 84 138 93 143 D 89 150 91 140 A 78 116 100 162 A 86 155 C 81 145 86 140 ; TITLE'Reading in Data separated by a tab'; PROCPRINTDATA=bp; RUN;

  15. * Column Input: Data in fixed columns. DATA bp; INFILE DATALINES ; INPUT clinic $ 1-1 dbp6 2-4 sbp6 5-7 dbpbl 8-10 sbpbl 11-13 ; DATALINES; C084138093143 D089150091140 A078116100162 A 086155 C081145086140 ; Title'Reading in Data using Column Input'; PROCPRINTDATA=bp; Note: missing data is blank

  16. * Reading data using Pointers and Informats DATA bp; INFILE DATALINES ; INPUT @1 clinic $1. @2 dbp6 3. @5 sbp6 3. @8 dbpbl 3. @11 sbpbl 3. ; DATALINES; C084138093143 D089150091140 A078116100162 A 086155 C081145086140 ; Title'Reading in Data using Point/Informats'; PROCPRINTDATA=bp; Informats must end with a period.

  17. * Reading data using Informat Lists DATA quallife; INFILE DATALINES ; INPUT (QL1-QL35) (1.) ; DATALINES; 31232242414444223544545354455342324 21112353214352552525522662566553533 21122252241333356262662366266551525 ; Title'Reading in Data using Informat Lists'; PROCPRINTDATA=quallife; VAR QL1-QL35; RUN; O Q Q Q Q Q Q Q Q Q L L L L L L L L L L L L L L L L L L L L L L L L L L b L L L L L L L L L 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 s 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 1 3 1 2 3 2 2 4 2 4 1 4 4 4 4 2 2 3 5 4 4 5 4 5 3 5 4 4 5 5 3 4 2 3 2 4 2 2 1 1 1 2 3 5 3 2 1 4 3 5 2 5 5 2 5 2 5 5 2 2 6 6 2 5 6 6 5 5 3 5 3 3 3 2 1 1 2 2 2 5 2 2 4 1 3 3 3 3 5 6 2 6 2 6 6 2 3 6 6 2 6 6 5 5 1 5 2 5

  18. Program 2 * Reading data from an external file DATA bp; INFILE‘C:\SAS_Files\bp.csv'DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; TITLE'Reading in Data from an External File'; PROCPRINTDATA=bp; clinic,dbp6,sbp6,dbpbl,sbpbl C,84,138,93,143 D,89,150,91,140 A,78,116,100,162 A,,,86,155 C,81,145,86,140 Content of bp.csv

  19. PARTIAL SAS LOG 7 DATA bp; 8 INFILE 'C:\SAS_Files\bp.csv' DSD FIRSTOBS=2 ; 9 INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; NOTE: The infile 'C:\SAS_Files\bp.csv' is: File Name=C:\SAS_Files\bp.csv, RECFM=V,LRECL=256 NOTE: 5 records were read from the infile 'C:\SAS_Files\bp.csv'. The minimum record length was 10. The maximum record length was 16. NOTE: The data set WORK.BP has 5 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.10 seconds cpu time 0.01 seconds

  20. *Reading data from an external file using a FILENAME statement; FILENAME bpdata ‘C:\SAS_Files\bp.csv'; DATA bp; INFILE bpdata DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ; TITLE'Reading in Data Using FILENAME'; PROCPRINTDATA=bp;

  21. * Using PROC IMPORT to read in data ; * Can skip data step; * Can also try IMPORT Wizard; PROCIMPORT DATAFILE=‘C:\SAS_Files\bp.csv' OUT = bp DBMS = csv REPLACE ; GETNAMES = yes; GUESSINGROWS=9999; TITLE'Reading in Data Using PROC IMPORT'; PROCPRINTDATA=bp; PROCCONTENTS DATA=bp; Uses first row for variable names

  22. * PC SAS can read excel files directly; PROCIMPORT DATAFILE=‘C:\SAS_Files\bp.xls' OUT = bp DBMS = xls REPLACE ; GETNAMES = yes; TITLE'Reading in Data from excel'; PROCPRINTDATA=bp; PROCCONTENTS; Uses first row for variable names

  23. The CONTENTS Procedure Data Set Name WORK.BP Observations 5 Member Type DATA Variables 5 Alphabetic List of Variables and Attributes # Variable Type Len Format Informat 1 Clinic Char 1 $1. $1. 2 DBP6 Num 8 BEST12. BEST32. 4 DBPBL Num 8 BEST12. BEST32. 3 SBP6 Num 8 BEST12. BEST32. 5 SBPBL Num 8 BEST12. BEST32.

  24. SOME INFILE OPTIONS • OBS - limits number of observations read • FIRSTOBS - start reading from this obs. • MISSOVER and TRUNCOVER - used to read in data with short records • TERMSTR= used when reading PC files on a UNIX machine (or vice versa) • LRECL= needed when you have data with long records

  25. Problem when reading past default logical record length; DATA temp; INFILE‘C:\SAS_Files\tomhs.data' OBS=6 ; INPUT @260 jntpain 2. ; TITLE'Data not read in correctly because variable is past default LRECL of 256'; PROCPRINT; NOTE: Invalid data for jntpain in line 2 NOTE: SAS went to a new line when INPUT statement reached past the end of a line Obs jntpain 1 . 2 . 3 .

  26. *Add LRECL option to fix problem ; DATA temp; INFILE‘C:\…\tomhs.data' OBS=6 LRECL=500; INPUT @260 jntpain 2. ; TITLE'Data read in correctly using LRECL option'; PROCPRINT; Obs jntpain 1 1 2 1 3 1 4 1 5 1 6 2

  27. Reading Special Data • 04/11/1982 Date • 59,365 Comma in number • 086-59-9054 Long (>8) characters Informat • 04/11/1982 mmddyy10. • 59,365 comma6. • 086-59-9054 $11.

  28. * Reading special data with fixed position data; DATA info; INFILE DATALINES; INPUT @1 ssn $11. @13 taxdate mmddyy10. @25 income comma6. ; DATALINES; 086-59-9054 04/12/2001 59,365 405-65-0987 03/15/2002 26,925 212-44-9054 04/15/2003 44,999 ; TITLE'Variables with Special Formats'; PROCPRINTDATA=info; FORMAT taxdate mmddyy10.; Obs ssn taxdate income 1 086-59-9054 04/12/2001 59365 2 405-65-0987 03/15/2002 26925 3 212-44-9054 04/15/2003 44999

  29. * Reading special data with list input using colon modifier; DATA info; INFILE DATALINES DLM=“;” DSD; INPUT ssn : $11. taxdate : mmddyy10. income : comma6. ; DATALINES; 086-59-9054;04/12/2001;59,365 405-65-0987;03/15/2002;26,925 212-44-9054;04/15/2003;44,999 ; TITLE'Variables with Special Formats'; PROCPRINTDATA=info; FORMAT taxdate mmddyy10.; Obs ssn taxdate income 1 086-59-9054 04/12/2001 59365 2 405-65-0987 03/15/2002 26925 3 212-44-9054 04/15/2003 44999

  30. * Using INFORMAT statement to supply input formats; DATA info; INFILE DATALINES DLM=“;” DSD; INFORMAT ssn $11. taxdate mmddyy10. income comma6.; INPUT ssn taxdate income ; DATALINES; 086-59-9054;04/12/2001;59,365 405-65-0987;03/15/2002;26,925 212-44-9054;04/15/2003;44,999 ; TITLE'Variables with Special Formats'; PROCPRINTDATA=info; FORMAT taxdate mmddyy10.; Obs ssn taxdate income 1 086-59-9054 04/12/2001 59365 2 405-65-0987 03/15/2002 26925 3 212-44-9054 04/15/2003 44999

  31. Summary of Ways of Reading in Data • List input - data is separated by a delimiter; must read in all variables. • Column input - data is in fixed columns;must know where each variable starts and ends; can read in selected variables • Pointers and Informats - alternative to column input; most flexible; must be used for special data • PROC IMPORT

More Related