Habituating to AWS Glue
Despite my strong opinion against the AWS Glue with its unfriendly documentation and strange approach to anything... I ended up using it as the main framework for an AWS native ETL (Extract, Transform, Load) service. The whole journey felt like trying to make divorced parents get back together. They're working together but the process felt artificial and not sure whether they're meant for each other. The success was due to finding out Glue's some dirty secrets.
What's the problem?
To be completely fair, the problem with Glue is because of a use case that seems trivial but is surprisingly challenging. The goal is to have a workflow of dependent jobs all of which lift and transform a few Redshift's tables and upserts the result into the same cluster. Simple, right? For starters, although the Glue context allows reading from JDBC (Redshift connector) it can only read data by specifying a table name, thus it lifts the whole table. That would be fine if we were dealing with tables up to a few GB since that's transferred using UNLOAD, which is fast, to S3, which is cheap. In my use case, however, some tables will soon be in TB so lifting the whole table is a waste in bandwidth, connection time and most importantly money spent on the Glue, Redshift and S3.The first workaround was to use directly the Spark context with it's JDBC connector. It works nicely for lifting data with custom SQL clauses allowing for joins and wheres, resulting in a DataFrame. Great and almost done. But, now the problem is with upserts. Redshift does not support upserts. The recommended method is to insert into a temporary table, delete all duplicates from the target and then append new data. Spark connector has "execute SQL" method but... it doesn't support transactions. We definitely want to avoid a situation when the deletion is successful but the insert is corrupted.
At this point, it felt like being betrayed by the Glue promises; no native support for such simple use case and they promised one-stop-shop for all ETL. The AWS is pushing hard to make the Redshift default analytics DB but most tools either provide support big data dumps to Redshift or exports to other services. If I can't get AWS' support then let me help myself. Let's import a custom package. But, since I'm going to do that, why not use Lambdas with Step Functions or Data Pipeline?
Why not Lambda (Step Function) or Data Pipeline?
Both are viable options and both have their quirks. For one, Data Pipeline is significantly limited in what it can do. Nevermind that it looks and feels like a service that owners want to deprecate but it does something Ok and it has dependent users. Not much has changed in the last couple of years and, besides, how seriously can you treat something that requires names to start with "My" like "MyFirstDataPipeline". There are RedshiftCopyActivity and SqlActivity which might be helpful here but they still require provisioning a resource (EC2 or EMR) to run on. A micro EC2 should be fine but if I'm going to define step by step everything I might as well not limit my activity options and go straight to the AWS Step Functions.AWS Step Functions seem to be the proper solution. The list of triggering events and what actions can be executed is constantly growing. It seems to be easily expandable and, given that many new services quickly after their release have a hook, it gives a hope that this is The AWS orchestration solution. What's the quirk? Well, we still need to run the query somewhere. The obvious choice is Lambda. In the majority of cases that should be enough but in general there's a max timeout of 15 min we already have some queries that take about 20 min. There was a hope that since the query can be written as the Redshift procedure without any output it shouldn't require an active connection to finish it. Unfortunately, even though neither pg8000 and psycopg wouldn't cancel their job on the timeout, the Redshift would treat it as a broken transaction and rollback. Since the Lambda is a process and until there's another requesting the same resource it will live, some hacking might allow to not kill the connection on the timeout but this wouldn't be reliable. So, two-way-door plan: let's focus on the Glue and if their workflow is limited we can execute Glue job via Step Functions. Either way, there's going to be a boilerplate written so it might be starting with the Glue.