1 / 49

Extending and Customizing IBM SPSS Statistics with R, Python, and . NET

Jon Peck Senior Software Engineer, IBM peck@us.ibm.com November, 2010. Extending and Customizing IBM SPSS Statistics with R, Python, and . NET. IBM SPSS Statistics.

tan
Download Presentation

Extending and Customizing IBM SPSS Statistics with R, Python, and . NET

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jon Peck Senior Software Engineer, IBM peck@us.ibm.com November, 2010 Extending and Customizing IBM SPSS Statistics with R, Python, and .NET

  2. IBM SPSS Statistics • IBM ® SPSS ® Statistics has an extensive command language (syntax) for data acquisition, manipulation, and statistical and graphical procedures • Programmability and scripting dramatically extend these built-in capabilities • Allow custom user interfaces and output to be produced • Converting large SASapplications is likelyto require the use ofprogrammability

  3. Agenda • Programmability introduction • Four examples • Automating repetitive work: applySyntaxToFiles • Integrating programs and scripting: SPSSINC MODIFY TABLES • Adding a procedure from R: SPSSINC QUANTILE REGRESSION • Adding a procedure in Python:SPSSINCTURF

  4. Programmability increases your power, flexibility, and productivity • Generalization • React flexibly to metadata, results, and the environment • Benefit: Write fewer similar jobs • Automation • Embed program logic in jobs • Benefit: Less manual work • Extension • Tap existing R or Python statistical modules • Add your own or extend standard procedures and transformations • Benefit: More capabilities • Integration • Connect IBM SPSS Statistics inputs and outputs to other agents • Benefit: Make IBM SPSS Statistics part of a larger production process • More productivity and more fun

  5. IBM SPSS Statistics embeds three programming languages • Plug-ins let you extend capabilities using • Python • R • .NET languages (Windows only) • Free plug-in downloads • SPSS Developer Central web site provides articles, SPSS-written modules, plug-ins and user contributions • New SPSS Community on IBM myDeveloperWorks

  6. My first Python program GET FILE="c:/data/important.sav". • Python or R program code goes in the normal Statistics syntax window BEGIN PROGRAM PYTHON. import spss print "Hello, IBM" END PROGRAM. DESCRIPTIVES ....

  7. Programmability combines SPSS Statistics with Python, R, or .NET • A program in the input stream can communicate with IBM SPSS Statistics and control it and use Python or R facilities and modules (internal mode) spss.Submit("GET FILE='c:/data/cars.sav'.") • A Python or .NET application can embed IBM SPSS Statistics inside itself (external mode) • User interface does not appear • There is a lower level C API available in an SDK

  8. Programmability functionality is fully integrated into IBM SPSS Statistics • Programs run in the regular syntax stream • Users can define IBM SPSS Statistics syntax for program and scripts via Extension mechanism. • Users can create dialog boxes and menus using the Custom Dialog Builder. • Not just for extensions or programs • Python and R output appears in the Viewer • plain text • pivot tables • charts

  9. Python and R Programmability API's cover these areas • State information of Statistics • Get/Set variable dictionary information • Get/Set data • Get Viewer output (via xmlworkspace) • Create tables/charts/text objects in Viewer • Run Statistics commands (Python only)

  10. Python and VB scripting API's cover user interface and output • Programmability is a backend (SPSS Processor) domain • Scripting is mainly a frontend (user interface, including output) domain • Managing output Viewer and objects • tables: formatting, pivoting, editing, … • objects: visibility, order, titles, outline text,… • General user interface control • Almost anything you can do via the user interface • Not available for R

  11. .NET plug-in embeds Statistics inside another programExample: Statistical Explorer • Statistics, graphs, and data management via Statistics • Two pages of VB.NET code

  12. Python and R are open source software • Programmability plug-ins are an optional installation • They are free (but require a Statistics license) • They make possible tapping the work of the Python and R communities • Python and R have license agreements • IBM Non-warrenty license agreement • For R, GPL license

  13. Extension commands eliminate need for user to learn Python or R • Extension mechanism lets you define IBM SPSS Statistics-style syntax for programs • IBM SPSS Statistics takes care of validation and parsing • Passes user input to a program in an easy-to-digest form • Automatically loaded when IBM SPSS Statistics starts • Look to the user like built in commands • Easy to distribute to others

  14. Some statistical extensions on Dev Central

  15. Some non-statistical extensions on Dev Central

  16. You can create and share your own additions to IBM SPSS Statistics • Write Python or R functions to implement the functionality or tap existing packages • Use input API's to get data to Python or R • Use output API's to create pivot tables Can each be a single line of code • For extensions, • Define the syntax in an xml file • Use tools in extension.py (Python) or spsspkg (R) to receive parsed output and pass to implementing function • New in v18: R version of extension.py • Use the Custom Dialog Builder to create the interface • The CDB is not just for extensions • Test and document! • Package and distribute • Contributions to Developer Central are welcome • Documentation is at SPSS Developer Central

  17. Extension commands: validation and mapping from syntax to Python or R function parameters is handled for you • Example: SPSSINC BREUSCH PAGAN • implemented using an R package • SPSSINC_BREUSCH_PAGAN.xml specifies the syntax to the Statistics parser • The R mapping code in SPSSINC_BREUSCH_PAGAN.R respecifies the syntax and invokes the executing routine with parsed parameters • overlaps with xml syntax definition but provides additional features SPSSINC BREUSCH PAGAN DEPENDENT = salary ENTER = educjobcat /OPTIONS MISSING=LISTWISE /SAVE RESIDUALSDATASET=resids COEFSDATASET=coefs.

  18. An XML file defines the syntax to the SPSS Universal Parser

  19. Python or, in this case, R code gets the parsed syntax, which is turned into function arguments

  20. Expand the audience by creating IBM SPSS Statistics syntax and dialog boxes

  21. Example I Generalize and automate work • You have syntax files and need to process datasets not known in advance every day • applySyntaxToFiles function applies a syntax file to each file in input specification

  22. Use programmability to automate routine processes • Apply standard processing to an unknown set of files • Produce processed data and reports

  23. Use a program to drive processing begin program. import spss, spssaux3 spssaux3.applySyntaxToFiles(inputspec="c:/temp/parts/*.sav", syntax = "c:/myjobs/dailychecks.sps", outputdatadir = "c:/temp/processed", outputfiledir = "c:/temp/processed", logfile ="c:/temp/processed/report.txt") end program. • dailychecks.sps could apply data cleaning rules, modify data, and create reports • Could be run daily through Production Mode or C&DS job scheduler or used interactively • Extended version available as SPSSINC PROCESS FILES

  24. Example II Automate dynamic or static formatting of tables • Use integrated scripting for better table presentation

  25. SPSSINC MODIFY TABLES extension command manipulates table formatting and structure • TableLooks provide static formatting for entire areas of a table • data cells • row and column layers • You want tables with formatting beyond tableLooks • Many users copy tables to Excel and manually format them  • Basic and Python Scripting provide programmatic way to do formatting • SPSSINC MODIFY TABLES provides syntax for extensive formatting • Eliminates need to know scripting • Uses Extension mechanism for programs and Python scripting

  26. Use dynamic highlighting to make crosstab table easier to read SPSSINC MODIFY TABLES SUBTYPE='Crosstabulation' DIMENSION=ROWS SELECT='Std. Residual' /STYLES TEXTSTYLE=BOLD BACKGROUNDCOLOR=255 0 0 APPLYTO='abs(x) >2'.

  27. Custom dialog boxes are easy to create • Dialog created withCustom Dialog Builder • Generates extension command syntax • Easy to distribute

  28. Use static formatting to call out parts of a table SPSSINC MODIFY TABLES subtype='variables in the equation' SELECT="B" "Sig." /STYLES TEXTCOLOR = 0 0 255 BACKGROUNDCOLOR=0 255 0.

  29. Format CTABLES totals to call them out SPSSINC MODIFY TABLES SUBTYPE="Custom Table" SELECT = "Total" DIMENSION=ROWS /STYLES BACKGROUNDCOLOR=255 255 88 TEXTSTYLE = BOLD

  30. Use custom functions for special effects SPSSINC MODIFY TABLES SUBTYPE='Report' SELECT="<<ALL>>" /STYLES APPLYTO=DATACELLS TEXTCOLOR=255 255 255 TEXTSTYLE=BOLD CUSTOMFUNCTION="customstylefunctions.washColumnsBlue". def washColumnsBlue(obj, i, j, numrows, numcols, section, more): mincolor=150. maxcolor=255. increment = (maxcolor - mincolor)/(numcols-1) colorvalue = round(mincolor + increment * j) obj.SetBackgroundColorAt(i,j, RGB((mincolor, mincolor, colorvalue)))

  31. It is possible to get carried away with this

  32. Example III Extend IBM SPSS Statistics by tapping the work of the R and Python communities • Add R procedures seamlessly to IBM SPSS Statistics

  33. R • R is a programming language for statistics • leading edge statistics • many contributed statistics and graphics packages • free • R is not so easy to learn • Documentation by experts for experts • Feels like a complex programming language – because it is • Syntax is a lot like C • Error in optim(rho, f, control = control, hessian = TRUE, method = “BFGS”) :initial value in ‘vmmin’ is not finite • Good for programmers(?); bad for users • R holds data in memory • R for SAS and SPSS Users, Bob Muenchen, Addison-Wesley, 2008

  34. R procedures can be accessed from IBM SPSS Statistics using the R plug-in • The R plug-in makes it easy to use R packages • IBM SPSS Statistics datasets and Viewer output can be processed by R using plug-in • Graphical, text, and table output appear in the Viewer • Pivot tables can be created with R code • New IBM SPSS Statistics datasets can be created from R • R communicates with IBM SPSS Statistics via API's in plug-in • Integration requires writing a little R wrapper code • IBM SPSS Statistics can provide • dialog box interface • IBM SPSS Statistics-style syntax • pivot table output • Plug-in is downloadable from Developer Central

  35. Quantile regression models conditional quantiles • Ordinary regression models conditional mean • Median regression is 50thquantile • Estimating quantiles is useful with varying spread, asymmetries, outliers • Areas of application include • empirical finance • value at risk • mutual fund investment styles • credit scoring • school quality • demand analysis • others

  36. SPSS QUANTILE REGRESSION extension embeds R quantreg package

  37. Pivot tables and plots appear in the Viewer

  38. New datasets appear in Data Editor windows

  39. Example IV Extend IBM SPSS Statistics by adding procedures in Python • TURF analysis

  40. TURF Analysis is popular in market research • Total Unduplicated Reach and Frequency (TURF) • Find the highest coverage of positive responses for a small number of questions • Example: How do you reach the largest audience by advertising on a few kinds of sports? • football, cricket, basketball, cycling, ... • Example: What ice cream flavors should you offer in your shops that have three dispensing machines? • Example: What phone features should you promote? • multi-line, voicemail, paging, internet ... • Simple FREQUENCIES does not account for overlap

  41. TURF calculations are demanding • Must compute all possible set unions of positive responses (up to a maximum number of variables). • Each set is a list of case ID’s with positive response on a question. • This problem is computationally explosive Calculations for best 10 combinations of variables Is a scripting language like Python too slow?

  42. Extension command SPSSINC TURF is implemented in Python • Provides • Dialog box interface • IBM SPSS Statistics style syntax • The computations • Pivot table output • Fewer than 300 lines of Python code • Plus dialog box definition • Plus extension command syntax definition • Executes requests involving a few million set comparisons in a few minutes • Initial version written in two days

  43. Analysis of phone data Telcosurvey (9 variables 1000 cases) dialog created with Custom Dialog Builder

  44. Results show the combination of features – best reach Pivot table created from Python code Best singles are conference calling, call forwarding, and call waiting

  45. The best three are not the top three one at a time Calculations completed in a few seconds

  46. Where we have been today • Python and R integration • Unification of programs and scripts • Custom Dialog Builder • Extensions • SPSS Developer Centralis your friend

  47. Questions ? ?

  48. Programmability increases your power, flexibility, and productivity with IBM SPSS Statistics • Generalization and automation • applySyntaxToFiles • SPSS MODIFY TABLES • Extension • SPSSINC QUANTREG using R • SPSSINC TURF using Python • Many new extension commands available • Integration • applySyntaxToFiles as part of a process • And it's still more fun

  49. Contact Jon K Peck, Ph. D. Senior Software EngineerIBM SPSS peck@us.ibm.com blog: insideout.spss.com

More Related