Saturday, 28 October 2017

Passing data-frames from Java to R

Recently we needed to implement a scenario where Java code was to, incrementally, hand over data to R, this data was to be accumulated R-side into one "uber" data-frame which was then processed, e.g: <- function( {
        # process the here
In terms of technology on Java side we had rJava and REngine. On R side, in addition to R ver. 3.4.x, we had dplyr.
R has extensive capabilities of converting CSV files into data-frames, and the approach we took takes leverages those:
0. While there are more chunks available:
    1. Java-side: write chunk of data into a csv file
    2. Java-side: notify R of the csv file
    3. R-side: read the csv file and append it to the uber data-frame
    4. Java-side: repeat from 0
Once Java-side has consumed all the data:
5. Java-side: invoke uber data-frame processing on R-side
To support this logic the Java pseudo code looks as follows:
    while (moreChunksAvailable) {  
        path2csv = write2csv ( getNextChunk() )
        rEngine.parseAndEval(  String.format("process.chunk( %s )", path2csv)   );

    rEngine.parseAndEval(  "process.done.all.chunks()"   );

On the R-side we have:


    process.chunk <- function( path2csv ) {
        chunk.df <- read.csv2(path2csv, ....)

        # chunk specific logic here

        if !exists("") {
            assign("",  chunk.df,   envir = .GlobalEnv)
        } else {
            assign("", bind_rows(, chunk.df),  envir = .GlobalEnv)

        # partial uber data-frame logic here

    process.done.all.chunks <- function { )    
    } <- function( {
        # process the
A note on dplyr and bind_rows. R has many ways of adding rows to an existing data-frame with rbind probably being the simplest. We found dplyr.bind_rows to be much more memory efficient than rbind.

No comments:

Post a Comment