The last post about AWS EMR and S3 has resulted in few people messaging me directly. To ease others let me add something about how I approach a specific problem.
As mentioned previously when dealing with large amount of data some precious needs to be made. There isn’t a solution which would fit all computation problems (obviously) but that doesn’t mean there aren’t better starting points.
In case when the input data is relatively small, say less than a terabyte, and the processing is highly parallelizable producing larger output, then it helps to do everything locally. If the input data is in S3, or we want to store the output to S3, then one can copy data with S3-dist-cp. It’s an extended version of dist-cp with the understanding of AWS S3 so it’s rather safe. All EMR instances have it installed by default making it easy to either execute through shell after ssh onto master, or, which is preferred, execute it as a EMR job step.
It’s reliable enough that for a given set of problems it was better to write a quick wrapper which converted a single step
spark-submit s3://bucket/path/to/script.py --src=s3://bucket/input/data --dest=s3://bucket/output/data
into three steps, download-process-upload, i.e.
s3-dist-cp --src=s3://bucket/input/data --dest=/hadoop/input/data spark-submit s3://bucket/path/to/script.py --src=/hadoop/input/data --dest=/hadoop/output/data s3-dist-cp --src=/hadoop/output/data --dest=s3://bucket/output/data
This is great when we have a large number of executors, definitely more than 200. But even then, experiment. Sometimes it’s better to reduce the number of executors and increase their onload.