Integrating R Language Runtime System with a Data Stream Warehouse

Integrating the R Language Runtime System with a Data Stream Warehouse

Talk Outline • Motivation • Past stream ATT systems • System architecture • Integrating R runtime with query processor • Bidirectional calls: R calls SQL, SQL calls R • Benchmark of data mapping & transfer

Network Data Streams • Feeds: devices, logs • Timestamps • Intermittent • Arrival our of order • Varying speed • Varying schema • Active processing, but not real-time: <5 mins • Sliding time window

Motivation: SQL • Expressive, standardized. well understood • Efficient, parallel, tunable • Extensible via UDFs

Motivation: Scaling R • Remove RAM limitation • Go beyond 1-threaded processing in 1 node • Parallel processing on multiple nodes • Both worlds • Manage big data in a DBMS • Exploit R math capabilities

Past ATT systems • Gigascope: ultra fast processing stream in NIC (packet level), restricted form of SQL language, no historic tables • DataDepot: store summarized streams, band joins, POSIX file system, compiled SQL queries, integration with feed mgt system, UDFs • TidalRace: Big Data trend, scale out, “V”ariety

TidalRace • HDFS • Large number of nodes • Direct & fast copy of log files, no preprocessing • Multiple asynchronous stream loading • Eventual consistency: MVCC • time-varying schema • Light DBMS for metadata • Integration with stream feed system • Compiled SQL queries

Tidalrace Architecture Data loading and update propagation Maintenance Queries Tidalrace metadata Storage Manager (D3SM) MySQL Data partitions and indices File system (local, D3FS, HDFS)

Temporal Partitioning Index Data New data Time • The primary partitioning field is the record timestamp • Stream data is mostly sorted • Most new data loads into a new partition • Avoid rebuilding indices • Simplified data expiration – roll off oldest partitions

R runtime: challenges • Dynamic storage in RAM, variable generations • Type checking at runtime • RAM constraint to call functions • Data structures: data frames, matrices, vectors • Functional and OO language • Dynamic processing; garbage collector • Runtime based on S language, programmed in C • Block-based processing requires refactoring R libs

Applications

STAR: STream Analytics in R • Separate Unix process • 64 bit memory space • 32 bit int for arrays • Packed binary file • Pipes • Embedded R in C • Compiled query in exec()

Assumptions • Stream velocity handled at ingestion/loading • Acceptable 1-5 minute delay in stream load + analysis • Small size materialized views: time range & aggregation • Large RAM in analytic server • Unlimited size historical tables • Sliding time window: recent data • Separate Unix process: R runtime, compiled query

Mapping Data Types • Atomic • time (POSIX) • int • float (real) • string • Data structures (challenge in SQL, not relational!) • data frame • vector • matrix • list

Data Transfer • Bidirectional pipe: to transform streams into table or to transfer data set • No parsing at all • Packed binary file (varying length strings) • Block-based processing in R (requires reprogramming and wrapping R calls) • Programming: C vs R (speed vs abstraction)

Complexity • Space • Data set: O(dn) • model O(d),O(d2) in RAM • Time • O(dn) to transfer • O(d2n) for many models • Time complexity lower than queries lower than computing a model, same as transforming data set

R calls SQL • Query always has time range: reduce size • Block-based processing to fit in RAM • Packed binary file resembles a packet: header+payload • Always, log data set has timestamps • Every table in SQL can be processed in R, but not every R result can be sent back to DBMS

SQL calls R • Via aggregate UDF, which builds data set • Assumption: most math models take matrices as input. Therefore, given two data set layouts they are converted to matrix form (dense or sparse). • Conversion: table rows with floats are converted to vectors, most tables are converted to matrices, or in general tables with diverse data types are converted to data frames

Examples: use cases • R calls SQL: statistical analyst needs some specific data from the DBMS extracted with comples query. Then computes descriptive statistics and math models • SQL calls R: BI person needs to call some mathematical function in R on a data frame (e.g. smooth a time series) or matrix (get correlation matrix)

Hardware: 4 cores 2Ghz, 4 GB RAM, 1 TB disk (real server much bigger) Software: Linux, R, HDFS, MySQL, GNU C++ Compare read/transfer speed in C and R Compare text (csv) vs binary files Measure throughput (10X faster than query processing) Benchmark: low end equipment

Discussion on performance • Binary files required for high performance: 100X faster than csv files • C 1000X faster than R, but difficult to debug • Disk I/O does not matter for large file because it is sequential access • Data transfer is not a bottleneck (SQL query or R call take >5 seconds on large data set)

Conclusions • Combine SQL queries and R functions seamlessly • Data transfer at maximum speed: reach streaming speed coming from a row DBMS • R can process streams coming from the DBMS, the DBMS can call R in a streaming fashion • Function calls can be fully bidirectional • Any table can be transferred in blocks to R, but only data frames can be transferred from R to DBMS (asymmetric)

Future work • Portable interfacing program with other DBMSs; challenge: source code • Consider alternative storage in DBMS: column, array => data type mapping plus storage conversion • Parallel processing with R on multiple nodes • Evolving models on a stream (time window, visualize) • Debugging dynamic R code in a compiled SQL query

Integrating R Language Runtime System with a Data Stream Warehouse