530 likes | 804 Views
Jon Peck Senior Software Engineer, IBM peck@us.ibm.com November, 2010. Extending and Customizing IBM SPSS Statistics with R, Python, and . NET. IBM SPSS Statistics.
E N D
Jon Peck Senior Software Engineer, IBM peck@us.ibm.com November, 2010 Extending and Customizing IBM SPSS Statistics with R, Python, and .NET
IBM SPSS Statistics • IBM ® SPSS ® Statistics has an extensive command language (syntax) for data acquisition, manipulation, and statistical and graphical procedures • Programmability and scripting dramatically extend these built-in capabilities • Allow custom user interfaces and output to be produced • Converting large SASapplications is likelyto require the use ofprogrammability
Agenda • Programmability introduction • Four examples • Automating repetitive work: applySyntaxToFiles • Integrating programs and scripting: SPSSINC MODIFY TABLES • Adding a procedure from R: SPSSINC QUANTILE REGRESSION • Adding a procedure in Python:SPSSINCTURF
Programmability increases your power, flexibility, and productivity • Generalization • React flexibly to metadata, results, and the environment • Benefit: Write fewer similar jobs • Automation • Embed program logic in jobs • Benefit: Less manual work • Extension • Tap existing R or Python statistical modules • Add your own or extend standard procedures and transformations • Benefit: More capabilities • Integration • Connect IBM SPSS Statistics inputs and outputs to other agents • Benefit: Make IBM SPSS Statistics part of a larger production process • More productivity and more fun
IBM SPSS Statistics embeds three programming languages • Plug-ins let you extend capabilities using • Python • R • .NET languages (Windows only) • Free plug-in downloads • SPSS Developer Central web site provides articles, SPSS-written modules, plug-ins and user contributions • New SPSS Community on IBM myDeveloperWorks
My first Python program GET FILE="c:/data/important.sav". • Python or R program code goes in the normal Statistics syntax window BEGIN PROGRAM PYTHON. import spss print "Hello, IBM" END PROGRAM. DESCRIPTIVES ....
Programmability combines SPSS Statistics with Python, R, or .NET • A program in the input stream can communicate with IBM SPSS Statistics and control it and use Python or R facilities and modules (internal mode) spss.Submit("GET FILE='c:/data/cars.sav'.") • A Python or .NET application can embed IBM SPSS Statistics inside itself (external mode) • User interface does not appear • There is a lower level C API available in an SDK
Programmability functionality is fully integrated into IBM SPSS Statistics • Programs run in the regular syntax stream • Users can define IBM SPSS Statistics syntax for program and scripts via Extension mechanism. • Users can create dialog boxes and menus using the Custom Dialog Builder. • Not just for extensions or programs • Python and R output appears in the Viewer • plain text • pivot tables • charts
Python and R Programmability API's cover these areas • State information of Statistics • Get/Set variable dictionary information • Get/Set data • Get Viewer output (via xmlworkspace) • Create tables/charts/text objects in Viewer • Run Statistics commands (Python only)
Python and VB scripting API's cover user interface and output • Programmability is a backend (SPSS Processor) domain • Scripting is mainly a frontend (user interface, including output) domain • Managing output Viewer and objects • tables: formatting, pivoting, editing, … • objects: visibility, order, titles, outline text,… • General user interface control • Almost anything you can do via the user interface • Not available for R
.NET plug-in embeds Statistics inside another programExample: Statistical Explorer • Statistics, graphs, and data management via Statistics • Two pages of VB.NET code
Python and R are open source software • Programmability plug-ins are an optional installation • They are free (but require a Statistics license) • They make possible tapping the work of the Python and R communities • Python and R have license agreements • IBM Non-warrenty license agreement • For R, GPL license
Extension commands eliminate need for user to learn Python or R • Extension mechanism lets you define IBM SPSS Statistics-style syntax for programs • IBM SPSS Statistics takes care of validation and parsing • Passes user input to a program in an easy-to-digest form • Automatically loaded when IBM SPSS Statistics starts • Look to the user like built in commands • Easy to distribute to others
You can create and share your own additions to IBM SPSS Statistics • Write Python or R functions to implement the functionality or tap existing packages • Use input API's to get data to Python or R • Use output API's to create pivot tables Can each be a single line of code • For extensions, • Define the syntax in an xml file • Use tools in extension.py (Python) or spsspkg (R) to receive parsed output and pass to implementing function • New in v18: R version of extension.py • Use the Custom Dialog Builder to create the interface • The CDB is not just for extensions • Test and document! • Package and distribute • Contributions to Developer Central are welcome • Documentation is at SPSS Developer Central
Extension commands: validation and mapping from syntax to Python or R function parameters is handled for you • Example: SPSSINC BREUSCH PAGAN • implemented using an R package • SPSSINC_BREUSCH_PAGAN.xml specifies the syntax to the Statistics parser • The R mapping code in SPSSINC_BREUSCH_PAGAN.R respecifies the syntax and invokes the executing routine with parsed parameters • overlaps with xml syntax definition but provides additional features SPSSINC BREUSCH PAGAN DEPENDENT = salary ENTER = educjobcat /OPTIONS MISSING=LISTWISE /SAVE RESIDUALSDATASET=resids COEFSDATASET=coefs.
Python or, in this case, R code gets the parsed syntax, which is turned into function arguments
Expand the audience by creating IBM SPSS Statistics syntax and dialog boxes
Example I Generalize and automate work • You have syntax files and need to process datasets not known in advance every day • applySyntaxToFiles function applies a syntax file to each file in input specification
Use programmability to automate routine processes • Apply standard processing to an unknown set of files • Produce processed data and reports
Use a program to drive processing begin program. import spss, spssaux3 spssaux3.applySyntaxToFiles(inputspec="c:/temp/parts/*.sav", syntax = "c:/myjobs/dailychecks.sps", outputdatadir = "c:/temp/processed", outputfiledir = "c:/temp/processed", logfile ="c:/temp/processed/report.txt") end program. • dailychecks.sps could apply data cleaning rules, modify data, and create reports • Could be run daily through Production Mode or C&DS job scheduler or used interactively • Extended version available as SPSSINC PROCESS FILES
Example II Automate dynamic or static formatting of tables • Use integrated scripting for better table presentation
SPSSINC MODIFY TABLES extension command manipulates table formatting and structure • TableLooks provide static formatting for entire areas of a table • data cells • row and column layers • You want tables with formatting beyond tableLooks • Many users copy tables to Excel and manually format them • Basic and Python Scripting provide programmatic way to do formatting • SPSSINC MODIFY TABLES provides syntax for extensive formatting • Eliminates need to know scripting • Uses Extension mechanism for programs and Python scripting
Use dynamic highlighting to make crosstab table easier to read SPSSINC MODIFY TABLES SUBTYPE='Crosstabulation' DIMENSION=ROWS SELECT='Std. Residual' /STYLES TEXTSTYLE=BOLD BACKGROUNDCOLOR=255 0 0 APPLYTO='abs(x) >2'.
Custom dialog boxes are easy to create • Dialog created withCustom Dialog Builder • Generates extension command syntax • Easy to distribute
Use static formatting to call out parts of a table SPSSINC MODIFY TABLES subtype='variables in the equation' SELECT="B" "Sig." /STYLES TEXTCOLOR = 0 0 255 BACKGROUNDCOLOR=0 255 0.
Format CTABLES totals to call them out SPSSINC MODIFY TABLES SUBTYPE="Custom Table" SELECT = "Total" DIMENSION=ROWS /STYLES BACKGROUNDCOLOR=255 255 88 TEXTSTYLE = BOLD
Use custom functions for special effects SPSSINC MODIFY TABLES SUBTYPE='Report' SELECT="<<ALL>>" /STYLES APPLYTO=DATACELLS TEXTCOLOR=255 255 255 TEXTSTYLE=BOLD CUSTOMFUNCTION="customstylefunctions.washColumnsBlue". def washColumnsBlue(obj, i, j, numrows, numcols, section, more): mincolor=150. maxcolor=255. increment = (maxcolor - mincolor)/(numcols-1) colorvalue = round(mincolor + increment * j) obj.SetBackgroundColorAt(i,j, RGB((mincolor, mincolor, colorvalue)))
Example III Extend IBM SPSS Statistics by tapping the work of the R and Python communities • Add R procedures seamlessly to IBM SPSS Statistics
R • R is a programming language for statistics • leading edge statistics • many contributed statistics and graphics packages • free • R is not so easy to learn • Documentation by experts for experts • Feels like a complex programming language – because it is • Syntax is a lot like C • Error in optim(rho, f, control = control, hessian = TRUE, method = “BFGS”) :initial value in ‘vmmin’ is not finite • Good for programmers(?); bad for users • R holds data in memory • R for SAS and SPSS Users, Bob Muenchen, Addison-Wesley, 2008
R procedures can be accessed from IBM SPSS Statistics using the R plug-in • The R plug-in makes it easy to use R packages • IBM SPSS Statistics datasets and Viewer output can be processed by R using plug-in • Graphical, text, and table output appear in the Viewer • Pivot tables can be created with R code • New IBM SPSS Statistics datasets can be created from R • R communicates with IBM SPSS Statistics via API's in plug-in • Integration requires writing a little R wrapper code • IBM SPSS Statistics can provide • dialog box interface • IBM SPSS Statistics-style syntax • pivot table output • Plug-in is downloadable from Developer Central
Quantile regression models conditional quantiles • Ordinary regression models conditional mean • Median regression is 50thquantile • Estimating quantiles is useful with varying spread, asymmetries, outliers • Areas of application include • empirical finance • value at risk • mutual fund investment styles • credit scoring • school quality • demand analysis • others
SPSS QUANTILE REGRESSION extension embeds R quantreg package
Example IV Extend IBM SPSS Statistics by adding procedures in Python • TURF analysis
TURF Analysis is popular in market research • Total Unduplicated Reach and Frequency (TURF) • Find the highest coverage of positive responses for a small number of questions • Example: How do you reach the largest audience by advertising on a few kinds of sports? • football, cricket, basketball, cycling, ... • Example: What ice cream flavors should you offer in your shops that have three dispensing machines? • Example: What phone features should you promote? • multi-line, voicemail, paging, internet ... • Simple FREQUENCIES does not account for overlap
TURF calculations are demanding • Must compute all possible set unions of positive responses (up to a maximum number of variables). • Each set is a list of case ID’s with positive response on a question. • This problem is computationally explosive Calculations for best 10 combinations of variables Is a scripting language like Python too slow?
Extension command SPSSINC TURF is implemented in Python • Provides • Dialog box interface • IBM SPSS Statistics style syntax • The computations • Pivot table output • Fewer than 300 lines of Python code • Plus dialog box definition • Plus extension command syntax definition • Executes requests involving a few million set comparisons in a few minutes • Initial version written in two days
Analysis of phone data Telcosurvey (9 variables 1000 cases) dialog created with Custom Dialog Builder
Results show the combination of features – best reach Pivot table created from Python code Best singles are conference calling, call forwarding, and call waiting
The best three are not the top three one at a time Calculations completed in a few seconds
Where we have been today • Python and R integration • Unification of programs and scripts • Custom Dialog Builder • Extensions • SPSS Developer Centralis your friend
Questions ? ?
Programmability increases your power, flexibility, and productivity with IBM SPSS Statistics • Generalization and automation • applySyntaxToFiles • SPSS MODIFY TABLES • Extension • SPSSINC QUANTREG using R • SPSSINC TURF using Python • Many new extension commands available • Integration • applySyntaxToFiles as part of a process • And it's still more fun
Contact Jon K Peck, Ph. D. Senior Software EngineerIBM SPSS peck@us.ibm.com blog: insideout.spss.com