AWS EMR with data in medium input and large output in AWS S3

Dawid Laszuk published on
2 min, 328 words

The last post about AWS EMR and S3 has resulted in a few people messaging me directly. To ease others let me add something about how I approach a specific problem.

As mentioned previously when dealing with large amount of data some precious needs to be made. There isn't a solution which would fit all computation problems (obviously) but that doesn't mean there aren't better starting points.

In case when the input data is relatively small, say less than a terabyte, and the processing is highly parallelizable producing larger output, then it helps to do everything locally. If the input data is in S3, or we want to store the output to S3, then one can copy data with S3-dist-cp. It's an extended version of dist-cp with the understanding of AWS S3 so it's rather safe. All EMR instances have it installed by default making it easy to either execute through shell after ssh onto master, or, which is preferred, execute it as a EMR job step.

It's reliable enough that for a given set of problems it was better to write a quick wrapper which converted a single step

spark-submit s3://bucket/path/to/script.py --src=s3://bucket/input/data --dest=s3://bucket/output/data

into three steps, download-process-upload, i.e.

s3-dist-cp --src=s3://bucket/input/data --dest=/hadoop/input/data
spark-submit s3://bucket/path/to/script.py --src=/hadoop/input/data --dest=/hadoop/output/data
s3-dist-cp --src=/hadoop/output/data --dest=s3://bucket/output/data

This is great when we have a large number of executors, definitely more than 200. But even then, experiment. Sometimes it's better to reduce the number of executors and increase their onload.