process.uber.data.frame <- function(uber.data.frame) { # process the uber.data.frame hereIn terms of technology on Java side we had rJava and REngine. On R side, in addition to R ver. 3.4.x, we had dplyr.
}
R has extensive capabilities of converting CSV files into data-frames, and the approach we took takes leverages those:
0. While there are more chunks available: 1. Java-side: write chunk of data into a csv file 2. Java-side: notify R of the csv file 3. R-side: read the csv file and append it to the uber data-frame 4. Java-side: repeat from 0Once Java-side has consumed all the data:
5. Java-side: invoke uber data-frame processing on R-sideTo support this logic the Java pseudo code looks as follows:
while (moreChunksAvailable) { path2csv = write2csv ( getNextChunk() ) rEngine.parseAndEval( String.format("process.chunk( %s )", path2csv) ); } rEngine.parseAndEval( "process.done.all.chunks()" );On the R-side we have:
library(dplyr) process.chunk <- function( path2csv ) { chunk.df <- read.csv2(path2csv, ....) # chunk specific logic here if !exists("uber.data.frame") { assign("uber.data.frame", chunk.df, envir = .GlobalEnv) } else { assign("uber.data.frame", bind_rows(uber.data.frame, chunk.df), envir = .GlobalEnv) } # partial uber data-frame logic here } process.done.all.chunks <- function { process.uber.data.frame( uber.data.frame ) } process.uber.data.frame <- function(uber.data.frame) { # process the uber.data.frame }A note on dplyr and bind_rows. R has many ways of adding rows to an existing data-frame with rbind probably being the simplest. We found dplyr.bind_rows to be much more memory efficient than rbind.
No comments:
Post a Comment