450 likes | 577 Views
R and Modern Statistical Computing. Robert Gentleman. Outline. Introduction R past R present R future Bioconductor. What is R?. R is an environment for data analysis and visualization R is an open source implementation of the S language
E N D
R andModern Statistical Computing Robert Gentleman
Outline • Introduction • R past • R present • R future • Bioconductor
What is R? • R is an environment for data analysis and visualization • R is an open source implementation of the S language • S-Plus is a commercial implementation of the S language • The current version of R is 1.4.1 • www.r-project.org
R Core • Doug Bates, John Chambers, Peter Dalgaard, Robert Gentleman, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Martin Maechler, Guido Masarotto, Paul Murrell, Brian Ripley, Duncan Temple Lang, and Luke Tierney • Duncan Murdoch, Martyn Plummer, Vincent Carey
Funding for R • to date R has had little funding (no formal funding) • our universities, particularly the University of Auckland have provided support • Dept of Biostatistics, Harvard has donated 5,000.00
R History • 1991: Ross Ihaka and Robert Gentleman begin work on a project that will ultimately become R • 1992: Design and implementation of pre-R. • 1993: The first announcement of R • 1995: R available by ftp under the GPL
R History • 1996: A mailing list is started and maintained by Martin Maechler at ETH • 1997: The R core group is formed • 1999: DSC meeting in Vienna, the first time many of R core meet • 2000: R 1.0.0 is released • 2002: R 1.4.1 is the current release
Open Source • R is both open source and open development • you can look at the source code and you can propose changes that we will generally adopt • R is not in the public domain • You are given a license to run our software • GPL (current) • LGPL (under consideration)
R and Omegahat • Omegahat: www.omegahat.org • Omegahat is another initiative that will allow us to explore alternative implementations and languages without disturbing the R user base too much • Current contents are largely the work of Duncan Temple Lang and John Chambers
R Design • Many of the features of S but with slightly different semantics and memory management. • We chose Scheme for our semantic model. • Much of the original code has since been replaced but the basic model remains intact.
R Internals • R is written mainly in C • Our original intention was for R to be as platform independent as practical. • We began with Macintosh as a primary delivery platform and Unix as our primary development platform.
What Platforms? • Unix of many flavours including Linux, Solaris, FreeBSD, AIX (compiles on 64 bit machines) • Windows - 95/98/NT and 2000 • both binaries and source available • R can be obtained from • www.r-project.org
R Internals • One difference with S is scope • R uses a different set of rules to bind variables to values • In S it is hard to treat programs as data • R should be source code compatible with S-Plus for most code that you will write
Environments • an environment is a mechanism for binding symbols to values (hence similar to a hash table) • each environment has a parent environment • a big difference between R and S is that R has lexical scope
Environments • a function has an environment associated with it and that environment provides bindings for any free variables function • another way that this can be thought of is that in R functions have mutable state • Ihaka and Gentleman (JCGS, 2000) • environments are also associated with formulas in R
How did we do it? • we took advantage of certain technologies • CVS – for version control • a reasonably sophisticated checking system • every example in R is runnable and is run many times by all users • any changes made must pass the checking routine before they are commited
Testing • this very simple idea makes distributed development possible • I am responsible for writing examples for my code (and I should be because I know it) • others are responsible for making sure that they do not break my code (by running my examples)
R Package System • packages are self-contained units of code with documentation • there are automatic testing features built in • all functions must have examples and the examples must run • interesting commands: • example, update.packages
Databases • R will talk to most databases • the ability to access large tables, execute SQL queries etc • RPgSQL has the notion of proxy objects • R symbols refer to tables in the database • these can look like data.frames in R
Object Oriented Programming • S3 class system is a good start but it has some major deficiencies • in Programming with Data, John Chambers introduced a new and potentially much better system • object oriented programming helps us build better programs and deal more naturally with complex data structures
Object Oriented Programming • a formal mechanism for defining classes of objects • these provide us with an abstraction that lets us deal with complex data • generic functions and methods also reduce complexity (for the user) • plot is a generic, methods are defined to implement plot for different types of data
Object Oriented Programming • more important for developers than for users • it may not be worth defining classes and methods interactively • Vincent Carey has been working on better mechanisms for documenting the classes and methods
R as a broker • R can execute code in virtually any other language • R has connections, these can be used to access data via different protocols • R is embeddable in other languages • systems like Perl, Python, Postgres, Apache • allow the user to define and use procedural languages
R as a broker • this means that we can push the calculations to more natural places • computation can be done where the data are rather than by transporting data • this will greatly increase our ability to process large data sets
R: Future • where to next? • XML and markup languages • compilation • object oriented programming
XML • eXtensible Markup Language • has many friends, XSLT, XLINK, … • similar to HTML, but more flexible • <foo> hi there </foo> • I define my own tags, and provide information about their meaning
XML • it allows us to provide semantics/meaning to data • it separates content from presentation • content can be presented in many different ways (SAS – output) • we can use a single parser written by an expert
XML • data can be read and understood directly from the source • eg: we want to search PubMed abstracts • these are contained in web pages at NCBI • using the XML package and htmlTreeParse this is a simple operation from within R
XML • will form the basis of a more flexible documentation format • documentation is really content, how you view the help page is rendering (is HTML, internal R, etc). • the ability to selectively run examples with lots of control
XML • live documents • reports etc can be made into live documents using XML (or similar strategies) • see Sweave (Leisch, 2002) in R 1.5.0 or from Fritz’s web site • documents can automatically update (daily/weekly etc)
Compilation • most users are interested in compilation because they believe it will increase speed • we are interested in it for a variety of reasons • understanding how to compile helps us understand how the language functions (where the warts are) • virtual machines: JVM, .Net
Training • we need to develop a new syllabus for statistical computing courses • tools that are needed include • computational inference • database interactions • software design and structure • markup languages (and relatives)
The Future • statistical computing can develop into a rich subject if it is encouraged • encouragement needs to take several different approaches • support: financial, career development, • statistical computing is a laboratory science, it needs to be funded and run that way
Production of Code • we need to encourage (very strongly) writers of methodology to provide code that implements their methodology • the mathematical or theoretical description of a data analytic technique is really worth very little • if that technique is implemented then it is much more useful
Production of Code • the R package system is a reasonable delivery mechanism • some design principles will be needed
An Example • Bioconductor is a new software initiative • www.bioconductor.org • among the goals of this project is the deployment of high quality software for the analysis of genomic data • the challenges are varied and exciting
Genomic Data • the data are large; tens of thousands of genes across a few hundred samples • the biologists have developed high throughput methods for screening samples • we need to develop high throughput methods for analysis
Genomic Data • other challenges: much of the data is non-numeric • the annotation of genes, their location on the chromosome, deletions, mutations • the role of the gene in a particular pathway
Genomics • what do we measure? • DNA (the raw thing) • mRNA (microarrays – transcribed DNA) • protein (proteomics – translated DNA) • these data gain value from annotation, from knowledge about adjacent genes or gene products • data sources are varied with different formats, error structures etc
TFG-b pathway • TGF-b (transforming growth factor beta) plays an essential role in the control of development and morphogenesis in multicellular organisms. • This is done through SMADS, a family of signal transducers and transcriptional activators.
Pathways • http://www.grt.kyushu-u.ac.jp/spad/ • There are many open questions regarding the relationship between expression level and pathways. • It is not clear whether expression level data will be informative.
Thanks • Ross Ihaka, without whom there would be no R • John Chambers, for S and gracious guidance • Luke Tierney, Vince Carey, Duncan Temple Lang • Dept of Stats, U of Auckland