630 likes | 641 Views
Learn to import and manipulate data in SAS for the Donor Recapture Case analysis. Explore data recoding and correction techniques to enhance data quality and analysis effectiveness.
E N D
SAS Enterprise MinerRelease 4.3 A brief overview: analysis of the Donor Recapture Case (Case 3) Kevin Garsek … Class of 2006
Importing Base Data • SAS’s main drawback is the fact that if any line of data has a null or blank value it will totally disregard the full record • In this case, if we were unable to manipulate the data, the available records would decrease dramatically • We can fight back by recoding the data as will be shown in the import step
Importing Charity Data Text Editor
Text Editor We will use the text editor in Base SAS to import the Charity Case data. In order to use this editor, you simply type as you would in any text editor.
Text Editor A line by line example of the code that we will use is as follows: libname charity 'C:\Documents and Settings\Kevin\Desktop\Datamining\charity.1'; denotes the master folder where the raw data is housed your local PC data charity.raw; tells SAS to create a new dataset named charity raw infile 'chr\2.dat' missover firstobs=2; lets SAS know the individual subfolder in which the data is housed and tells it to import it into the new dataset input OSOURCE $; names the data column OSOURCE and the $ tells SAS that this is character based data (if this was left out, SAS assumes that the data is numerical in format) OSOURCE_D = 0; due to prevalent missing data, this creates a new dummy variable termed OSOURCE_D and makes the value 0 for every record if trim(OSOURCE) = "“ the trim statement deletes any erroneous spaces and the if sets up the opening of an if then statement to compensate for blank data then do; OSOURCE = "0"; this sets all missing values in the OSOURCE column to 0 OSOURCE_D = 1; this sets the newly created dummy variable to 1 when OSOURCE was blank in the input file end; this ends this statement as all code from infile to end can be written on a single line in the text editor
Importing Charity Data The below depicts the completed code. The actual code can be easily written In Excel using a & statement and then pasted into the text editor. Moving the writing process to Excel will save considerable time during this laborious process.
Importing Charity Data Once the code is completed, you will need to right hand click in the text editor and select “submit all”. This will tell SAS to read through the code in the text editor and execute. Be prepared, due to the large size of the data, this will take considerable time to complete.
Starting Enterprise Miner from Base SAS module You should now have a fully working dataset and you are now ready to open Enterprise Miner by following the subsequent slides.
Binding Data to Program • This is an exasperating activity • Even for someone who took a SAS training course in Enterprise Miner • The documentation is pathetic • I’ll document each step carefully in case this ever happens to you
Bind Data to Project Right click on tools to get this menu.
Bind Data to Project Left click on initialization, left click top edit.
Bind Data to Project Right click select; browse for library RDATA; click ok
Bind Data to Project Gotcha: Must select RAW and hit enter even though only data set in RDATA
Change to Larger Sample Left click change; changed to 10,000 to give low response items representation
Click Variables Tab Notice that some variables rejected including some, this is typically due to the fact that that column has only one value throughout e.g. a dummy variable that is 0 due to no variation in the input data.
Then Bad Things Happen • Who knows why. • If I hadn’t taken the course the slides would stop here. • That’s the only reason I know what to do • I’ll document this also, in case it happens to you.
Crash Recovery Right click on top level icon; select explore
Crash Recovery Open emproj; delete all files with extension .lck; open user subfolder; delete everything in user subfolder
Analysis Resumes • We’ll have a look at MAILCODE. • Enterprise Miner has some neat graphical tools that are easy to use. • The simplest and easiest are part of the data input tool.
A Histogram Right click item, select “view distribution of MAILCODE” from drop down menu
Histogram of Mailcode SAS has classified as missing data that R accepted and used!
Must Identify TARGET_D as Target Right click row item in column “Model Role”, select “Change Model Role” from drop down menu, select “target” from next drop down menu
Histogram of Target This is what makes the problem hard: extremely low response rate!
Add Data Partition Node Drag down from tool bar above and connect line by dragging the mouse.
This is What it Does We will choose to use an 80%/20% training/validation allocation. Close box, right click, click “Run” on drop down menu.
Design Philosophy Click lower tools tab. Note tools on left. One drags a tool to worksheet and connects with arrows. We’ll now drag and connect regression.
Regression Chose stepwise selection, validation error. That mimics what we did in R.
Regression Right hand click on the Regression node and select run
Regression Regression is highlighted in green while running
Regression Lets take a look at the results; SAS has a very different interpretation of important variables that the R analysis
Regression The error rate is not that bad, but the significant variables are not necessarily easily interpretable.
Regression Lets try it again with a few changes to the model selection
Regression Again, we get results, but nothing easily interpretable.
Regression Lets limit the regression to those variables determined by R to be significant. To do this, we will again right hand click on regression and select open.
Regression Then go to the variables tab. Right hand click under the status column for each unneeded variable and set the status to “don’t use”.
Regression In addition to limiting our variables to those from the R results we are going to add an interaction as well as a squared variable. The first step is to add the squared term by adding a transform variables node and right hand clicking on the node and selecting open.
Regression From the variables tab, we will right hand click on DOB and select Transform.
Regression We will now select square. This will create a new variable, DOB_L1S6, which will then be used in our next regression.
Regression Our next step is to create an interaction. To do this, go back to the main diagram and double click on regression. This should bring you into the model manager where you will click on the Interaction Builder icon.
Regression On this screen, you should use the Ctrl button to highlight both Lastgift and Pepstrfl. Next, press the Cross button in order to create the new interaction variable. The new variable should be added to the available terms window and should be used in subsequent regressions.
Regression Results! While the initial bar graph may look complex, this is how SAS handles character data and creating dummy variables.
Regression As we now look at the table, or coefficient estimates, we have interpretable results!
Regression For those that are interested, you can look at the Code tab and see the actual SAS coding that one would have to write if you were to program this regression manually.
Regression Lets add another level of analysis and try to rid the data of outliers. To do this, you will need to incorporate a Filter Outlier node between the Transform Variables and Regression nodes.