270 likes | 340 Views
Statistical Software Programming. STAT 6360 –Statistical Software Programming. Data Input in SAS Many ways to get your data into SAS: Through data entry via viewtable and other tools As part of your data step via datalines .
E N D
STAT 6360 –Statistical Software Programming Data Input in SAS Many ways to get your data into SAS: • Through data entry via viewtable and other tools • As part of your data step via datalines. • Reading from raw data files (ascii or plain text files, delimited files) • Converting the data files of other software into SAS data files. • Reading other software’s data files directly. • We will cover all of these, but concentrate on how to read raw data using the input and infilestatements.
STAT 6360 –Statistical Software Programming Using Viewtable to Enter Data An easy way to enter small datasets is via viewtable. • Click Tools → Table Editor and a blank spreadsheet will pop up in viewtable. • Now you can enter data. Columns are variables, rows are observations. • Right-click column names to create variable names and set variable properties. • Then save as a SAS dataset in any available library and use as usual.
STAT 6360 –Statistical Software Programming Example – My Stats Grades Open viewtable. Create two character variables in columns A and B: class and grades. Use default properties. Enter the statistics classes you have taken (e.g., STAT2000, STAT4210) and the grades you received (A, B+, etc.) Save as dataset grades in WORK library. Close viewtable and save changes.
STAT 6360 –Statistical Software Programming Example – My Stats Grades Run the following code: Your grades data should print: procprintdata=grades; run; • Entering data in viewtable can be convenient for small tasks but is impractical for big datasets. • Even for small jobs, better to have a source file for your data or have it permanently recorded in your code (e.g., in datalines).
STAT 6360 –Statistical Software Programming Using PROC IMPORT & Import Wizard • SAS has a proc called PROC IMPORT that will import various file types and convert them to SAS datasets. • Excel files, Access files, comma separated values (.csv files), tab and other delimited ascii files, and files in SPSS, Stata, JMP and other formats. • The import wizard is a convenient front-end to PROC IMPORT that can be used to generate PROC IMPORT code and/or to run PROC IMPORT in the background if we don’t care to see that code. • We will illustrate it to import some data on nutritional information for different types of cereal. The cereal types were sampled using stratified random sampling from four shelves at Dillon’s grocery store in Manhattan, KS, in January, 1999. Shelf is the stratum. The data are in cereal_data.xls, an Excel file, which should be placed in your “Data Files” subfolder on your USB drive.
STAT 6360 –Statistical Software Programming Using PROC IMPORT & Import Wizard • Select File → Import Data…. This will give a dialog form which you can select the type of data source you want to import. Excel is default and we’ll use that, so click Next. • Now browse and select cereal_data.xls from your Data Files folder and press ok. • Note that there is currently a bug in SAS for Windows X64 that generates an error if you’re trying to import from the 32 bit version of MS Office. If you encounter this, go back and change the data source to Microsoft Excel Workbook on PC Files Server, click Next, Browse to the file, and click OK (leave the info in the PC Files Server box unchanged).
STAT 6360 –Statistical Software Programming Using PROC IMPORT & Import Wizard • Now you can select the worksheet within the Excel workbook if it has data in multiple worksheets (use Sheet1$ in our example) and, optionally, choose from some options that fine-tune the import (don’t change these options for this example). Click Next. • Now you can select a library and name your dataset. Use the WORK library and name the dataset cereal. • The wizard will generate PROC IMPORT code to import the Excel file. It now asks if you want that code written to a file so you have it for future use. Let’s do that. • Browse to the SAScode subfolder on your USB, set the filename to import_cereal.sas, and click Save (in the browse dialog). • Now click Finish. This will import the data, which you can now view with viewtable or by using proc print.
STAT 6360 –Statistical Software Programming Using PROC IMPORT & Import Wizard • Take a look at the PROC IMPORT code that was generated. • Open import_cereal.sas in the editor. Here’s the code: Dataset to be created PROCIMPORTOUT= WORK.cereal DATAFILE= "MyPath\Data Files\cereal_data.xls" DBMS= EXCEL REPLACE; RANGE="Sheet1$"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN; Source file to import Overwrite output dataset if it exists. Type of source file* These are file-format-specific statements. In this case, EXCEL-specific statements. See http://tinyurl.com/k5jojys for documentation. • Variable names taken from first row by default. To change, use statement GETNAMES=NO; • * See SAS/ACCESS(R) 9.3 Interface to PC Files: Reference (http://tinyurl.com/m9dz4a8) for list of DBMS choices.
STAT 6360 –Statistical Software Programming Using PROC IMPORT & Import Wizard • Always check the log file after trying to import a file. • Should look like this in our example: • Always close the Excel file before trying to import it!!! Leaving it open may result in failure to import the file or (worse) undetected strange behavior when importing the file.
STAT 6360 –Statistical Software Programming Using PROC IMPORT & Import Wizard • PROC IMPORT can also import .csv files. cereal_data.csv is a comma-separated version of the cereal dataset. • MS Office gives such files an Excel icon, but they are plain text files! Here is cereal_data.csv in Notepad:
STAT 6360 –Statistical Software Programming Using PROC IMPORT & Import Wizard • To import cereal_data.csv • choose the .csv file type • browse to the file’s location • don’t change any options (for now) • save the dataset as cereal2 (in WORK) • and save the PROC IMPORT code as import_cereal2.sas. • This seemed to work, but look at observation 25 (in viewtable). The cereal name is incomplete. • By default, the length of character variables is determined by scanning the 1st 20 records in the file. • Add guessingrows=41; statement to the PROC IMPORT code and retry. It works!
STAT 6360 –Statistical Software Programming Reading Data from Plain Text Files • Although some plain text files can be read with PROC IMPORT (delimited files), they are more flexibly input using the INFILE and INPUT statements in a data step. • Types of input – list, column, formatted, delimited, and named. We’ll cover the first 4 of these. List Input: • We used this when using datalines(pets.sas) and in pets2.sas. • Data must be separated by spaces or delimiters. • You must read all the data in each record (line of data). • All missing values must be indicated with periods. • No spaces allowed in values of character variables. • Need length statement for character variables with length>8. • Can’t handle special data types like dates, times.
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • This file illustrates many different input modes. • Example 1: we use list input to read in data on tar, nicotine, CO content for 25 cigarette brands. • Similar to pets2.sas but one complication: brand is too long (>8 characters). SAS handles this by truncating brand to length 8. • This can be fixed with an informator by adding a length statement to tell SAS brand has up to 16 characters. • Must appear before input. • Note the use of (obs=5) in PROC PRINT. • This is an example of a SAS data set option. Such options appear in parentheses next to the name of a dataset to be used in a PROC. They affect the dataset as it is brought into the PROC for manipulation/analysis. • In this case, we are telling PROC PRINT to use dataset cigs, but only use the first 5 observations.
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • Example 2: Consumer Reports Data on 1993 Cars • 26 variables, so each observation spans two lines in data file. • Two ways to handle this with input statement: • Insert slash (/) to tell SAS when to go to the next line. • Precede variables on line 1 with #1, vars on line 2 with #2, etc. • Again we used length statement because manuf, model are long. • This dataset has asterisks for missing values for some numeric variables. • When SAS encounters * when expecting a number, it generates a NOTE in log window and sets the value to missing. • This is ok in this example, but better to use periods in raw data file to avoid worrisome NOTE.
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • Example 3: TVs and Life Expectancy – Column Input Column Input • Useful for character or standard numeric data (no dates or special formatting) aligned in columns in raw data file. • Advantages over list input: • Spaces not required between values • Missing values can be left blank • Character values can be long and can contain spaces • Can skip unwanted values in raw data file • Easy to use. After each variable on the input statement specify the column range from which to read the variable. • Comes after $ if a character variable. • E.g.: inputname $ 1-10shoesize12-15;
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • Example 3: TVs and Life Expectancy – Column Input • Data from 40 largest countries on life expectancy, TVs and doctors per capita (inverted, actually) as of 1993. • Notice long country names with spaces handled without a length statement. • Blank values treated as missing (e.g., people per TV in Tanzania)
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • Example 4: Tour de France Winners – Formatted Input Formatted Input - Informats Q: Numeric data can’t contain characters (other than +,-,e) so how does SAS know to read 1,000 (one thousand) as 1000, or 07/04/1776 (Independence Day) as a numeric date? A: Informats An informat is a code to tell SAS the format to use when reading in data. • SAS also has formats, which tell it how to print or output a data value. You can think of them as “outformats”.
STAT 6360 –Statistical Software Programming Informats The three main types and their general form: • Every character informat begins with $ • The informat part is the name of the informat and, in part determines what it does. • The w part determines the total width • The d part determines how many digits follow the decimal point in a numeric format. • The period is an essential part of any informat.
STAT 6360 –Statistical Software Programming Informats Character formats do things like • Specify the length of the character value ($w.) • Trim leading blanks • Converts to UPPER CASE Numeric formats do things like • Specify width & how many decimal places (w.d) • Handle percentages, numbers with commas
STAT 6360 –Statistical Software Programming Informats Date and time formats • Read in dates (12/07/1941, 06-jun-44) of different form and convert them to SAS date values, which are the number of days since Jan. 01, 1960. • E.g., mmddyy10. reads 12/31/1959 and assigns it a SAS date value of -1 • Read in times of different form and convert them to SAS time values, which are the number of seconds since midnight of the current day. • E.g., time10. reads 00:10:07.4 and assigns it a SAS time value of 607.4 • Read in times on a particular day and convert them to SAS datetime values, which are the number of seconds since midnight, Jan. 01, 1960. • These conversions allow calculations on dates and times (e.g., how much time has passed between two dates).
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • Example 4: Tour de France Winners – Formatted Input • In this example, we use @n where n is a column number to tell the INPUT statement where to start reading each variable. • Then each variable is read in from that column using the informat given after the variable name on the INPUT statement. input@1year 4. @6start_town$18. @25start_datemmddyy8. Start from col 1 and read a number that has width 4 and no decimal places into the numeric variable year Then start in col 6 and read 18 columns into a character variable called start_town Then start at column 25 and read a date that is 8 columns wide and has format mmddyy (could be mm-dd-yy or mm/dd/yy) into the variable start_date. Treat as a SAS date value.
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • Example 4: Tour de France Winners – Formatted Input • The use of @n (@1, @6, @25, …) is one type of pointer control. • We are controlling where SAS points to when reading in the next data value. • Another pointer control is +n where n is the number of columns forward that you want SAS to move. • E.g., input @1 year 4. +1start_town$18.; • This tells SAS to start at column 1 and read four columns into year. SAS will then be pointing to column 5. Then +1 tells it to move ahead one column and point to column 6. Then it reads start_town with the informat$18.
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • Example 5: Pizza Size – Delimited Input • Delimited (e.g., .csv) data files can be read in with INFILE and INPUT statements instead of PROC IMPORT. • DLM=‘,’ tells SAS what kind of delimiter to expect • Use DLM=‘05’X for tab-delimited data. • DSD option tells SAS • Consecutive delimiters should be interpreted as missing values. • Don’t read quotation marks as part of the data value • Treat delimiters (e.g., commas) inside quotes as part of the value, not a delimiter. • MISSOVER option tells SAS not to go to next line in input file if values of some variables not provided. Instead give those variables missing values.
STAT 6360 –Statistical Software Programming Example: InputExamps.sas • Example 5: Pizza Size – Delimited Input • Colon (:) operator on a character informat(e.g., :$9.) tells SAS to read 9 columns into the character variable, but stop if you reach a delimiter before all 9 columns have been read. • To see how DSD, MISSOVER, colon operator work, try reading the data in pizzasize_fake.csv with the code in inputexs.sas that is written for that purpose, but omitting MISSOVER, DSD, and the colon on the :$9.informat, each in turn. Finally, note that it is ok and sometimes convenient to mix input modes within a single input statement. • Good example of this section 2.9 of D&S.