process.uber.data.frame <- function(uber.data.frame) {
# process the uber.data.frame here
}
In terms of technology on Java side we had rJava and REngine. On R side, in addition to R ver. 3.4.x, we had dplyr.
R has extensive capabilities of converting CSV files into data-frames, and the approach we took takes leverages those:
0. While there are more chunks available:
1. Java-side: write chunk of data into a csv file
2. Java-side: notify R of the csv file
3. R-side: read the csv file and append it to the uber data-frame
4. Java-side: repeat from 0
Once Java-side has consumed all the data:
5. Java-side: invoke uber data-frame processing on R-sideTo support this logic the Java pseudo code looks as follows:
while (moreChunksAvailable) {
path2csv = write2csv ( getNextChunk() )
rEngine.parseAndEval( String.format("process.chunk( %s )", path2csv) );
}
rEngine.parseAndEval( "process.done.all.chunks()" );
On the R-side we have:
library(dplyr)
process.chunk <- function( path2csv ) {
chunk.df <- read.csv2(path2csv, ....)
# chunk specific logic here
if !exists("uber.data.frame") {
assign("uber.data.frame", chunk.df, envir = .GlobalEnv)
} else {
assign("uber.data.frame", bind_rows(uber.data.frame, chunk.df), envir = .GlobalEnv)
}
# partial uber data-frame logic here
}
process.done.all.chunks <- function {
process.uber.data.frame( uber.data.frame )
}
process.uber.data.frame <- function(uber.data.frame) {
# process the uber.data.frame
}
A note on dplyr and bind_rows. R has many ways of adding rows to an existing data-frame with rbind probably being the simplest. We found dplyr.bind_rows to be much more memory efficient than rbind.

