I’ve been postponing describing my experience with AWS Glue for a couple of months now. It’s one of those things that I really wanted to get out of myself but it hurts to even speak up. Let’s end this pain, let’s end this now. Ehkem… AWS Glue sucks.
We had a use case of doing a daily ETL job from/to Redshift. The transformation was rather simple but given that developers will maintain the logic it was easier to write code rather than convoluted SQL. The job was small enough (<50k rows) that probably a Lambda with a longer timeout would be just fine; however, since more projects were coming that required a larger scale processing we were looking for potential candidates. This was a great opportunity to try out a service that boasts itself on the official blurb.
Issues started right away. Their documentation is/was really terrible. It describes how "great things are" rather than what they do. There were two pages dedicated to Redshift and they were convoluted enough that even AWS support team had difficulties understanding it. When deploying through Cloud Formation some options were missing and had to be manually updated, like activating trigger for cron job(?!). At the time of writing also only Python in version 2.7 was available with examples written by some Golang users, something like:
orgs = orgs.drop_fields(['other_names',
No doubt that AWS Glue will be updated and most likely it’s much better now than it was 2 months before. However, I have such a terrible mouthfeel after using it that it’s going to be hard to convince me to give it another shot in the near future. For simple tasks Lambda should be enough and for larger on a single data source use EMR. In cases when there are multiple sources with dependencies orchestrate everything using Data Pipeline. Seems that the Glue is an on-demand EMR with limited not-optimal configuration thus leaving with limited control.