240 likes | 372 Views
R: An Open Source Statistical Environment. Valentin Todorov UNIDO v.todorov@unido.org. MSIS 2008 (Luxembourg, 7-9 April 2008). Outline . Introduction: the R Platform and Availability R Learning Curve (is R hard to learn) R Extensibility (R Packages) R and the others (Interfaces)
E N D
R: An Open Source Statistical Environment Valentin Todorov UNIDO v.todorov@unido.org MSIS 2008 (Luxembourg, 7-9 April 2008) MSIS 2008, Luxembourg: Valentin Todorov
Outline • Introduction: the R Platform and Availability • R Learning Curve (is R hard to learn) • R Extensibility (R Packages) • R and the others (Interfaces) • R Graphics • R for Time series • R for Survey Analysis • R and the Outliers (Robust Statistics in R) • More R features (WEB, Missing data, OOP, GUI) • Summary and Conclusions MSIS 2008, Luxembourg: Valentin Todorov
What is R • R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities” • Developed after the S language and environment • S was developed at Bell Labs (John Chambers et al.) • S-Plus: a value added implementation of the S language- Insightful Corporation • much code written for S runs unaltered under R • Significantly influenced by Scheme, a Lisp dialect MSIS 2008, Luxembourg: Valentin Todorov
What is R • Ihaka and Gentleman, University of Auckland (New Zealand) • 1993 a preliminary version of R • 1995 released under the GNU Public License • Now: R-core team consisting of 17 members including John Chambers • R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques • R is available as Free Software under the terms of the GNU General Public License (GPL). MSIS 2008, Luxembourg: Valentin Todorov
R Extensibility (R Packages) • One of the most important features of R is its extensibility by creating packages of functions and data. • The R package system provides a framework for developing, documenting, and testing extension code. • Packages can include R code, documentation, data and foreign code written in C or Fortran. • Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions. MSIS 2008, Luxembourg: Valentin Todorov
R and the Others (R Interfaces) • Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel • Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign. • Emulation of Matlab – package matlab. • Communication with RDBMS – ROracle, RMySql, RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency • Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets • Can use compiled native code in C, C++, Fortran, Java MSIS 2008, Luxembourg: Valentin Todorov
R Graphics • One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots. • The graphics can include mathematical symbols and formulae where needed. • Can produce graphics in many formats: • On screen • PS and PDF for including in LaTex and pdfLaTeX or for distribution • PNG or JPEG for the Web • On Windows, metafiles for Word, PowerPoint, etc. MSIS 2008, Luxembourg: Valentin Todorov
R Graphics: basic and multipanel plots (trellis) MSIS 2008, Luxembourg: Valentin Todorov
R Graphics: parallel plot and coplot MSIS 2008, Luxembourg: Valentin Todorov
R for Time Series • Package stats • classical time series modeling tools – arima() for Box-Jenkins type analysis • structural time series – StructTS() • filtering and decomposition – decompose() and HoltWinters() • Package forecast – additional forecast methods and graphical tools • Analyzing monthly or lower frequency time series: • TRAMO/SEATS • X-12-ARIMA • accessible through the Gretl library • Task View Econometrics:http://cran.r-project.org/web/views/Econometrics.html MSIS 2008, Luxembourg: Valentin Todorov
R for Time Series: Example • Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic MSIS 2008, Luxembourg: Valentin Todorov
R for Survey Analysis • Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc. • STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages MSIS 2008, Luxembourg: Valentin Todorov
R for Survey Analysis • R – package survey - http://faculty.washington.edu/tlumley/survey/ • stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements • Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains • Variances by Taylor linearization or by replicate weights (BRR, jack-knife, bootstrap, or user-supplied) • Graphics: histograms, hexbin scatterplots, smoothers • Other packages: pps, sampling, sampfling MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers (Robust Statistics in R) • What are Outliers • atypical observations which are inconsistent with the rest of the data or deviate from the postulated model • may arise through contamination, errors in data gathering, or misspecification of the model. • classical statistical methods are very sensitive to such data • What are Robust methods • Produce reasonable results even when one or more outliers may appear in the data • Robust regression - robustbase • Robust multivariate methods – rrcov, robustbase • Robust time series analysis - robust-ts MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers: Example • Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/ • a national sample of 6000 households with a male head earning less than $15,000 annually in 1966 - 9 independent variables; classified into 39 demographic groups • estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average age of the respondents: • We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers: Example OLS MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers: Example LTS MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers: Example Covariance • Marona & Yohai (1998) • rrcov: data set maryo • A bivariate data set with: • sample correlation: 0.81 • interchange the largest and smallest value in the first coordinate • the sample correlation becomes 0.05 MSIS 2008, Luxembourg: Valentin Todorov
More R… • R and the WEB - several projects that provide possibilities to use R over the WEB • R and the Missing – advanced missing value handling • mvnmle: ML estimation for multivariate data with missing values • mitools: Tools for multiple imputation of missing data • mice - Multivariate Imputation by Chained Equations • EMV: Estimation of Missing Values for a Data Matrix • VIM: provides methods for the visualisation as well as imputation of missing data • R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#) MSIS 2008, Luxembourg: Valentin Todorov
More R… • R GUI • R Commander: a basic statistics GUI, consisting of a window containing several menus, buttons, and information fields • Sciviews: a suite of companion applications for Windows • R and SDMX • R Reports • package xtable: coerce data to LaTeX and HTML tables • package Sweave: a framework for mixing text and R code for automatic report gene MSIS 2008, Luxembourg: Valentin Todorov
Summary • Output Management System • SAS/SPSS: it is rarely used for routine work • R: output is easily passed from one function to another to do further processing and to obtain more results • Macro Language • SAS/SPSS: a special language with own syntax. The new functions are not run in the same way as the built-in procedures • R itself is a programming language • Matrix Language • SAS/SPSS: A special language with own syntax • R is a vector and matrix based language complemented by additional packages: Matitrx, SparseM MSIS 2008, Luxembourg: Valentin Todorov
Summary (cont.) • Publishing results • SAS/SPSS: Cut and paste to a Word processor or exporting to a file • R: produce LaTex output (including graphics) using for example the Sweave package • Data size • SAS/SPSS: Limited by the size of the disk • R: Limited by the size of the RAM, (not trivial) usage of databases for large data sets is possible • Data structure • SAS/SPSS: Rectangular data set • R: Rectangular data frame, vector, list MSIS 2008, Luxembourg: Valentin Todorov
Summary (cont.) • Interface to other programming languages • SAS/SPSS: Not available • R: R can be easily mixed with Fortran, C, C++ and Java • Source code • SAS/SPSS: Not available • R: the source code of R itself as well as of its packages is a part of the distribution MSIS 2008, Luxembourg: Valentin Todorov
References • Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20 2 pp 197-202 • Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http://www.statmethods.net/index.html • López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066-1081 • Muenchen, R. (2007), R for SAS and SPSS users, URL: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf • Murrel, P. (2005) R Graphics, Chapman & Hall • R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/ • Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication • Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers! MSIS 2008, Luxembourg: Valentin Todorov