310 likes | 560 Views
Data Transformation. Data cleaning. Importing Data. Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access data ODBC for external database data. Importing an Excel Spreadsheet. PROC IMPORT OUT= WORK.Fall2007
E N D
Data Transformation Data cleaning
Importing Data • Reading data from external formats • Libname/Infile/Input for text form data • Proc Import for Excel/Access data • ODBC for external database data
Importing an Excel Spreadsheet PROCIMPORT OUT= WORK.Fall2007 DATAFILE= "L:\DataWarehousing07f\CourseDatabase\Fall2007.xls" DBMS=EXCEL REPLACE; SHEET="'Fall 07$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN;
Import an Access Table PROC IMPORT OUT= WORK.OrderLine DATATABLE= "OrderLin" DBMS=ACCESS REPLACE; DATABASE="I:\DataWarehousing07f\WholesaleProducts.mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN;
Good Practice • Check the metadata for a dataset PROC CONTENTS DATA= OrderLine; RUN; • Print a few records PROC PRINT DATA= OrderLine (OBS= 10); RUN;
Saving SAS Datasets LIBNAME course "L:\DataWarehousing07f\CourseDatabase"; Data course.Spring2008; set spring2008; run; Note: the name associated with the libname command (“course”) must be 8 characters or less.
LIBNAME / INFILE / INPUT for character data • LIBNAME identifies the location or folder where the data file is stored • INFILE specifies the libname to use for reading external data. • INPUT reads text format data • SET reads SAS data
INFILE with INPUT for character data files DATA Fitness; INFILE "L:\DataWarehousing07f\TransformationSAS\SAS1.txt"; INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS; run;
Creating Derived Attributes Generating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code. • LENGTH statements • FORMAT statements • FORMATS and INFORMATS • PUT • INPUT
PUT and INPUT Functions TextOutput = PUT(variable, format) Note: the result of a put function is always character Note: there is also a PUT statement that writes the contents of a variable to the SAS log Output = INPUT(CharacterInput, informat) Note: the variable for an input function is always character
Formats • Formats always contain a period • Formats for character variables always start with a $ • The most used format categories are Character, Date and Time, and Numeric Note: use the SAS “search” tab to look for “Formats.” For a list of SAS formats look under: “Formats: Formats by Category”
Good Practice The following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library Data _Null_; InputVal= 123; OutputVal= PUT(InputVal, Roman30.); PUT InputVal OutputVal; run;
Generating Dates • Generating a Date dimension • Usually done offline in something like Excel • SAS has extensive date and datetime functions and formats • SAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.
Creating a text variable for Date Data Orders2; Length Date $10.; Set Orders; Date= PUT( Datepart(OrderDate), MDDYY8.); • The Length statement assures that the variable will have enough space. It must come before the SET. • OrderDate has DateTime format. The DATEPART function produces a date format output. MMDDYYx. is a date format type.
SAS Functions We are especially interested in “Character” and “Date and Time” functions Note: use the SAS “search” tab to look for “Functions.” For a list of SAS functions look under: “Functions and CALL routines: Functions and CALL Routines by Category”
Useful Data Cleaning Functions • Text Manipulation: • COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE, LOWCASE • Text Extraction • INDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD
Parsing • The process of splitting a text field into multiple fields • Uses SAS functions to extract parts of a character string. • Fixed position in a string: SUBSTR • Known delimiter: SCAN Note: it is a good idea to strip blanks before you try to parse a string.
Example of Parsing Data Customer2; LENGTH street cust_addr $20.; FORMAT street cust_addr $20.; SET Customer; Cust_Addr= TRIM(Cust_Addr); Number= Scan(Cust_Addr,1,' '); Street= Scan(Cust_Addr,2,' '); run; Note: The LENGTH and FORMAT statements clear trailing blanks for further display.
Parsing Results Obs cust_addr Number street 1 481 OAK 481 OAK 2 215 PETE 215 PETE 3 48 COLLEGE 48 COLLEGE 4 914 CHERRY 914 CHERRY 5 519 WATSON 519 WATSON 6 16 ELM 16 ELM 7 108 PINE 108 PINE
Good Practice Always print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses ‘110b Elm’ and ‘110 b Elm’ will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.
=SUBSTR(string, position<, length>) Use this when you have a known position for characters. • String: character expression • Position: start position (starts with 1) • Length: number of characters to take (missing takes all to the end) VAR= ‘ABCDEFG’ NEWVAR= SUBSTR(VAR,2,2) NEWVAR2= SUBSTR(VAR,4) NEWVAR= ‘BC’ NEWVAR2= ‘DEFG’
SUBSTR(variable, position<,length>) = new-characters Replaces character value contents. Use this when you know where the replacement starts. a='KIDNAP'; substr(a,1,3)='CAT'; a: CATNAP substr(a,4)='TY' ; a: KIDTY
INDEX(source, excerpt) • Searches a character expression for a string of characters. Returns the location (number) where the string begins. a='ABC.DEF (X=Y)'; b='X=Y'; x=index(a,b); x: 10 x= index(a,’DEF’); x: 5
Alternative INDEX functions • INDEXC searches for a single character • INDEXW searches for a word: Syntax INDEXW(source, excerpt<,delimiter>)
Length Returns the length of a character variable • The LENGTH and LENGTHN functions return the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0. • The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.
Standardizing • Adjusting terms to standard format. • Based off of frequency prints. • Use functions or IF statements • TRANWRD is easy but can produce unexpected results • IF statements are safer, but less general
Standardization Code Supplier= Tranwrd(supplier, " Incorporated", ""); If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things"; More complex logic is often needed. See the course examples.
Good Practice It is a good idea to produce a change log for standardized changes: Data Products2 Changed; Set Products; SupplierOld= Supplier; * * * * Output Products2; If Trim(supplier) ^= Trim(SupplierOld) then output Changed; Proc Print Data= Changed; Var SupplierOld Supplier;
Locating Anomalies • Frequency counts are a good way to identify anomalies. • It is also helpful to identify standard changes that you do not have to review. • Probably the safest way to execute standard changes is with a “Change Table” that lists From and To values. (Advanced SAS exercise – go for it!!)
De Duplicating • Reconcile different representations of the same entity • Done after standardizing. Usually requires multi-field testing. • May use probabilistic logic, depending on the application. • Should produce a change log.
Correcting • Identifying and correcting values that are wrong • Very difficult to do. Usually based off of exception reports or range checks.