R is a publicly available implementation of the high-level S language
for statistical computing.
(S-PLUS is a well known commercial statistical environment also based
on the S language.)
The S language is widely used for statistical
calculations, particularly in biology and medicine.
The S language is not widely regarded as a platform for
developing scalable, high-performance codes.
In both the R and S-PLUS environments, S programs are interpreted.
Moreover,
execution of S programs typically
involves dynamic allocation of large data structures,
particularly arrays.
However, we believe that an execution environment based on an advanced
optimizing compiler will be able to execute S programs an order-of-magnitude
faster than a naive interpreter.
In collaboration with
biostatistical researchers
from the M. D. Anderson Cancer Center,
we have conducted a study of a
number of applications written in S.
These applications include many
that employ calls to standard toolbox routines written by the
M. D. Anderson researchers. The S programs we examined also included
the use of many standard programming idioms.
Our study suggests that optimized compilation of these programs will improve
the performance of S programs by a factor of between 10 to 100.
We are working to create an open-source, portable,
retargetable, high-quality R compiler suitable for use with production codes.
Compiler Architecture:
The compiler system we are building has three phases:
- Analysis of programs and libraries written in R. Currently,
we are investigating static analysis techniques for call graph construction
and dataflow analysis of R. R's combination of function variables,
lexical scoping, and assignment make precise dataflow analysis hard.
- Translation of R programs into C.Initially, this translation
process was naive, and simply rewrote R programs to make calls to the
interpreter's
runtime support libraries. We have begun to exploit results from static
analysis of R to avoid dynamic lookup of function variable bindings and reduce
overheads associated with garbage-collected storage management. This
enables us to generate C programs that rely on the run-time
libraries less and perform operations more directly and efficiently.
- Analysis and source-to-source optimization of C programs.
This phase involves analysis and optimization of R programs translated into
C with the run-time support libraries as well. Goals of this effort include
replacing garbage-collected storage management with region-based storage
management along with domain-specific optimization of library based programs.
Current Status:
A version of the RCC compiler is in place; it uses static analysis to
improve performance for most R programs. RCC will be made publicly
available when it is ready for public use.
Downloads:
Check here soon for an updated version of RCC with bug fixes and improved optimization.
Publications:
Links:
Project Contacts:
External Collaborator:
Acknowledgements:
This work was supported in part by a RICE CITI Innovation Grant,
and the NPACI.