1 / 43

R and Modern Statistical Computing

R and Modern Statistical Computing. Robert Gentleman. Outline. Introduction R past R present R future Bioconductor. What is R?. R is an environment for data analysis and visualization R is an open source implementation of the S language

Download Presentation

R and Modern Statistical Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R andModern Statistical Computing Robert Gentleman

  2. Outline • Introduction • R past • R present • R future • Bioconductor

  3. What is R? • R is an environment for data analysis and visualization • R is an open source implementation of the S language • S-Plus is a commercial implementation of the S language • The current version of R is 1.4.1 • www.r-project.org

  4. R Core • Doug Bates, John Chambers, Peter Dalgaard, Robert Gentleman, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Martin Maechler, Guido Masarotto, Paul Murrell, Brian Ripley, Duncan Temple Lang, and Luke Tierney • Duncan Murdoch, Martyn Plummer, Vincent Carey

  5. Funding for R • to date R has had little funding (no formal funding) • our universities, particularly the University of Auckland have provided support • Dept of Biostatistics, Harvard has donated 5,000.00

  6. R History • 1991: Ross Ihaka and Robert Gentleman begin work on a project that will ultimately become R • 1992: Design and implementation of pre-R. • 1993: The first announcement of R • 1995: R available by ftp under the GPL

  7. R History • 1996: A mailing list is started and maintained by Martin Maechler at ETH • 1997: The R core group is formed • 1999: DSC meeting in Vienna, the first time many of R core meet • 2000: R 1.0.0 is released • 2002: R 1.4.1 is the current release

  8. Open Source • R is both open source and open development • you can look at the source code and you can propose changes that we will generally adopt • R is not in the public domain • You are given a license to run our software • GPL (current) • LGPL (under consideration)

  9. R and Omegahat • Omegahat: www.omegahat.org • Omegahat is another initiative that will allow us to explore alternative implementations and languages without disturbing the R user base too much • Current contents are largely the work of Duncan Temple Lang and John Chambers

  10. R Design • Many of the features of S but with slightly different semantics and memory management. • We chose Scheme for our semantic model. • Much of the original code has since been replaced but the basic model remains intact.

  11. R Internals • R is written mainly in C • Our original intention was for R to be as platform independent as practical. • We began with Macintosh as a primary delivery platform and Unix as our primary development platform.

  12. What Platforms? • Unix of many flavours including Linux, Solaris, FreeBSD, AIX (compiles on 64 bit machines) • Windows - 95/98/NT and 2000 • both binaries and source available • R can be obtained from • www.r-project.org

  13. R Internals • One difference with S is scope • R uses a different set of rules to bind variables to values • In S it is hard to treat programs as data • R should be source code compatible with S-Plus for most code that you will write

  14. Environments • an environment is a mechanism for binding symbols to values (hence similar to a hash table) • each environment has a parent environment • a big difference between R and S is that R has lexical scope

  15. Environments • a function has an environment associated with it and that environment provides bindings for any free variables function • another way that this can be thought of is that in R functions have mutable state • Ihaka and Gentleman (JCGS, 2000) • environments are also associated with formulas in R

  16. How did we do it? • we took advantage of certain technologies • CVS – for version control • a reasonably sophisticated checking system • every example in R is runnable and is run many times by all users • any changes made must pass the checking routine before they are commited

  17. Testing • this very simple idea makes distributed development possible • I am responsible for writing examples for my code (and I should be because I know it) • others are responsible for making sure that they do not break my code (by running my examples)

  18. R Package System • packages are self-contained units of code with documentation • there are automatic testing features built in • all functions must have examples and the examples must run • interesting commands: • example, update.packages

  19. Databases • R will talk to most databases • the ability to access large tables, execute SQL queries etc • RPgSQL has the notion of proxy objects • R symbols refer to tables in the database • these can look like data.frames in R

  20. Object Oriented Programming • S3 class system is a good start but it has some major deficiencies • in Programming with Data, John Chambers introduced a new and potentially much better system • object oriented programming helps us build better programs and deal more naturally with complex data structures

  21. Object Oriented Programming • a formal mechanism for defining classes of objects • these provide us with an abstraction that lets us deal with complex data • generic functions and methods also reduce complexity (for the user) • plot is a generic, methods are defined to implement plot for different types of data

  22. Object Oriented Programming • more important for developers than for users • it may not be worth defining classes and methods interactively • Vincent Carey has been working on better mechanisms for documenting the classes and methods

  23. R as a broker • R can execute code in virtually any other language • R has connections, these can be used to access data via different protocols • R is embeddable in other languages • systems like Perl, Python, Postgres, Apache • allow the user to define and use procedural languages

  24. R as a broker • this means that we can push the calculations to more natural places • computation can be done where the data are rather than by transporting data • this will greatly increase our ability to process large data sets

  25. R: Future • where to next? • XML and markup languages • compilation • object oriented programming

  26. XML • eXtensible Markup Language • has many friends, XSLT, XLINK, … • similar to HTML, but more flexible • <foo> hi there </foo> • I define my own tags, and provide information about their meaning

  27. XML • it allows us to provide semantics/meaning to data • it separates content from presentation • content can be presented in many different ways (SAS – output) • we can use a single parser written by an expert

  28. XML • data can be read and understood directly from the source • eg: we want to search PubMed abstracts • these are contained in web pages at NCBI • using the XML package and htmlTreeParse this is a simple operation from within R

  29. XML • will form the basis of a more flexible documentation format • documentation is really content, how you view the help page is rendering (is HTML, internal R, etc). • the ability to selectively run examples with lots of control

  30. XML • live documents • reports etc can be made into live documents using XML (or similar strategies) • see Sweave (Leisch, 2002) in R 1.5.0 or from Fritz’s web site • documents can automatically update (daily/weekly etc)

  31. Compilation • most users are interested in compilation because they believe it will increase speed • we are interested in it for a variety of reasons • understanding how to compile helps us understand how the language functions (where the warts are) • virtual machines: JVM, .Net

  32. Training • we need to develop a new syllabus for statistical computing courses • tools that are needed include • computational inference • database interactions • software design and structure • markup languages (and relatives)

  33. The Future • statistical computing can develop into a rich subject if it is encouraged • encouragement needs to take several different approaches • support: financial, career development, • statistical computing is a laboratory science, it needs to be funded and run that way

  34. Production of Code • we need to encourage (very strongly) writers of methodology to provide code that implements their methodology • the mathematical or theoretical description of a data analytic technique is really worth very little • if that technique is implemented then it is much more useful

  35. Production of Code • the R package system is a reasonable delivery mechanism • some design principles will be needed

  36. An Example • Bioconductor is a new software initiative • www.bioconductor.org • among the goals of this project is the deployment of high quality software for the analysis of genomic data • the challenges are varied and exciting

  37. Genomic Data • the data are large; tens of thousands of genes across a few hundred samples • the biologists have developed high throughput methods for screening samples • we need to develop high throughput methods for analysis

  38. Genomic Data • other challenges: much of the data is non-numeric • the annotation of genes, their location on the chromosome, deletions, mutations • the role of the gene in a particular pathway

  39. Genomics • what do we measure? • DNA (the raw thing) • mRNA (microarrays – transcribed DNA) • protein (proteomics – translated DNA) • these data gain value from annotation, from knowledge about adjacent genes or gene products • data sources are varied with different formats, error structures etc

  40. TFG-b pathway • TGF-b (transforming growth factor beta) plays an essential role in the control of development and morphogenesis in multicellular organisms. • This is done through SMADS, a family of signal transducers and transcriptional activators.

  41. Pathways • http://www.grt.kyushu-u.ac.jp/spad/ • There are many open questions regarding the relationship between expression level and pathways. • It is not clear whether expression level data will be informative.

  42. Thanks • Ross Ihaka, without whom there would be no R • John Chambers, for S and gracious guidance • Luke Tierney, Vince Carey, Duncan Temple Lang • Dept of Stats, U of Auckland

More Related