Leaving Amazon for a sabbatical

It’s been over a week since I left Amazon where I’ve worked for over three years. It seems like a right time to write out some thoughts on the journey.

Disclaimer

First of all, a disclaimer which should be obvious but one never knows. All thoughts are mine and from personal experience. Amazon is a huge company with a total employee counter closer to a million, including a (couple) hundreds thousands software developers. Others have gone through different experiences since they interacted with different people/teams.
Another disclaimer is that I have worked in the corporate side of the company which is different from warehouses and deliveries. Although one of my teams was in transportation and we provided solutions to the yard (roughly “parking outside warehouses”) associates, I still don’t feel confident to talk for that side.

What is good about Amazon?

There are plenty of positive things I would like to say about the company. Off the bat: I would recommend working there. Not everyone, which is why I have left, but I’m guessing it will suit most people. There is a large number of teams to try out and internal relocation policies enable resettling.

A first-time corporate employee, the most impressive thing about software development at Amazon was its internal knowledge. There are plenty of resources to go through and learn. Documentations, videos, design, discussions, internal “stack overflow”… At times it can be challenging to find what you’re looking for but that’s due to ever changing environment and inherited vast amounts of information making the scaling difficult. In addition to the written information, all teams have access to principal engineers or folks who have been for ages, and these are also great to poke their brains.

Another positive is the AWS availability. One can prototype as they wish and try out new features and services. It often simplified my design processes as I could quickly verify whether something works or not instead of getting through layers of documentation. It also removes the burden of constant thinking about costs. When using AWS on my own it matters whether I’m paying $10 or $200 a month but for work projects that’s a prototype cost (threshold is usually agreed with the manager).

Some might also find beneficial the opportunity of wearing many hats. I certainly did enjoy it and it was a great learning experience. SDEs at Amazon are defined by both their scope of influence and their role. You often have to participate in scope projects/product, design solutions, create a prototype, lead the teams, communicate with stakeholder and customers. I’ve seen many opinions on the internet about having many hats as a software developer but I’m of the opinion that one needs to know what and why they are building before actually doing. Developers aren’t “coding monkeys” and they should have a say in whatever they’re constructing. The questions would be more on the balance but, since I’m talking about my experience about Amazon, that balance can be shifted as required.

Why then have I left?

The decision for leaving wasn’t sudden; it was growing in me for over a year. I came to Amazon as a software developer with a PhD and machine learning (ML) experience. I was promised that I’ll be able to utilize my skills in analytical/ML-related challenging projects. That hasn’t happened. In the first year I was in a team with the Away Team model who doesn’t have their own projects. It’s more a mercenary that helps out others with some self-interest. Long story short: others like to keep interesting bits for themselves. The second team felt like a salvation. More promises came but I somehow believed them even though they were far in the future. Then, the future came and it was disappointing. More promises. I was somehow included in and given more analytical projects but they weren’t challenging; the challenge there was managing other’s work rather than working out solutions myself. Higher hopes were related to my latest org, Economical Technologies aka EconTech, which is “reinforcement learning first” org. I was there for about 3 months and it was kind of cosy with great expectations but… everything is just too slow. Not only in the EconTech but the whole Amazon. Taking all my experiences, low faith in any promise, adjusting with the covid-19 expected actions in the, I did a mental forecast for the next year and it has shown no difference. Given no expected progress, deteriorated team inclusively due to the pandemic, slashing my salary due to super-low number of stocks after the fourth year and simple annoyance of Amazon’s stance on increasing global wealth gap while it’s getting richer, well, it’s time to go.

Before going to next thought a quick explanation on what I mean by writing that Amazon is slow. Maybe the analogy to forest fires will be suitable? On the whole it’s quickly spreading and it’s super destructive, but if you focus on a specific point on the circumference then you’ll see that it’s rather slow. It’s slow but it’s a constant steady slow pace. Always that one meter a minute further from the centre. The thing is that the circle at this point is huge. So adding tiny bit in all directions can feel like an exponential growth. Amazon as the company is super fast. There are plenty of new services and ideas each year, and it’s expanding its tentacles almost everywhere. Super impressive! However, if we focus on individual products/teams, then it’s a different story. Most teams are slow. The phrase “it’s always day 1” to me means that everyone is new to the company and they haven’t figured out how to communicate effectively. And there’s new comers syndrome where everyone wants to impress others on their first day which leads to mutually self-imposed high expectations from ill-read peer-pressure. Many will work long, unproductive and mindless hours only decreasing the quality of the product. It’s slow because this only appears as a half-baked product. Is that bad? Well, only if you want to consume that product, otherwise put it on display and it looks awesome.

What am I going to do now?

I’m taking sabbatical for the next 6+ months. In my case, sabbatical means taking the time to focus on the skills I loved to use, i.e. analytical thinking and artificial intelligence. During this time I’d like to catch up with all the advancements in the machine learning world and how this could apply to the current pandemic world. I’m especially interested in focusing more on the (Deep) Reinforcement Learning and creating environments/agents. There are a few thoughts on how I could pay back to society by creating my own product. Probably more on that will come once I have more clarity on the problem and solutions.

Having written that, I’d like to be clear that I’m not closing myself on the outside. I’m happy to listen about all opportunities but I’ll be extremely picky and will prioritise interesting challenges.

And if that doesn’t pay out?

Except for money I’m not losing anything. I don’t need much in life but I acknowledge that I come from a fortunate position. I have everything that I already wanted and there are plans for those that are missing. Everything goes into emergency funds and retirement. If not now, then when?

AI Traineree – PyTorch Deep Reinforcement Learning lib

tl;dr: AiTraineree is a new Deep Reinforcment Learning lib based on PyTorch.

A few months ago by some coincidences at work and some news from newsletters I discovered the world of the Deep Reinforcement Learning. Until now it was “one of those” but on a closer inspection… I couldn’t get my eyes off. Something happened and then it clicked. I’ve started playing around with some gyms from the OpenAI and did a nanodegree course on Udacity, and the feeling grew. So, let me share the feeling.

I’ve started a yet-another Python lib to play around with the Deep Reinforcement Learning. It already has some more popular agents (DQN, PPO, DDPG, MADDPG) and is easy to use with the OpenAI gyms. The intention for the lib is to have a bigger zoo of agents, compatible with more environments and have tools for better developing/debugging. Although it is a work-in-progress project it is already usable. What distinguishes this from many is unification of types and making sure that all components can play nicely with each other. The lib is also based on the PyTorch which I’ve seen many smaller projects with DRL but they usually contain a single agent with a specific environment.

Let me know if you want anything specific in the lib. In a couple of weeks I’m planning to have significant contribution to the lib.

Timed OS light/dark theme switching

tl;dr: A GitHub gist with commands walk-through is available here.

What

The ability to adjust themes, and in particular the dark mode, have been one of the most trendy tech features of 2019/2020. Many sites and apps now allow to to flip between the “normal” and the “dark mode”.

Why

Although I don’t belong to the die hard zealots one can find on the internet, I do appreciate this feature when in dark environment as I’m rather light sensitive and most devices have the lowest brightness on too-high for me. It is a nice surprise that Ubuntu 20.04 came with the global theme and a couple default ones. This let’s me to decide when it’s dark and then switch to the dark mode. Since many pages, e.g. stackoverflow.com or duckduckgo.com, now detect OS’s theme mode they will also be in switch into the mode. Neat. So, when the light goes down, my dark mode goes on, and we’re all happy.

But obviously the night comes everyday so why should I sent those 3 seconds of manual labour when I can make it automatic?

How

I won’t into too much details but basically the proposed solution is using a service manager systemd and more specifically its systemctl command. There are two “services”, one for each theme flip, and they are run daily at specific time.

For the servicd to automagically detect your services and timers they can be be placed in ~/.config/systemd/user. It’s likely that there isn’t such directory so create it. The code also expects that there is a directory ~/.scripts where some random utility scripts are placed.

The walkthrough code is below. Please note that none of the files are expected to be where they are so you have to create them and fill with the content that the cat command returned. Also, the script changes default terminal profile and it’s expectating that there are two called “Dark” and “Light” for day and night, respectively.

user@host:~/$ mkdir -p ~/.config/systemd/user
user@host:~/$ mkdir -p ~/.scripts

user@host:~/$ cat ~/.config/systemd/user/dark.service  # Create this file
[Unit]
Description=Automatically change the "Window Theme" to "light" in the morning.

[Service]
Environment=DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
ExecStart=~/.scripts/profile_changer.sh light

user@host:~/$ cat .config/systemd/user/light.timer
[Unit]
Description=Automatically change the "Window Theme" to "light" in the morning.

[Timer]
OnCalendar=*-*-* 06:00:00
Persistent=true

[Install]
WantedBy=default.target

user@host:~/$ cat .config/systemd/userdark.service  # Create this file
[Unit]
Description=Automatically change the "Window Theme" to "dark" in the evening.

[Service]
Environment=DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
ExecStart=~/.scripts/profile_changer.sh dark

user@host:~/$ cat .config/systemd/user/dark.timer  # Create this file
[Unit]
Description=Automatically change the "Window Theme" to "dark" in the evening.

[Timer]
OnCalendar=*-*-* 19:00:00
Persistent=true

[Install]
WantedBy=default.target

user@host:~/$ cat .scripts/profile_changer.sh  # Create this file
#!/bin/bash

get_uuid() {
  # Print the UUID linked to the profile name sent in parameter
  local profile_name=$1
  profiles=($(gsettings get org.gnome.Terminal.ProfilesList list | tr -d "[]\',"))
  for i in ${!profiles[*]}
    do
      local uuid="$(dconf read /org/gnome/terminal/legacy/profiles:/:${profiles[i]}/visible-name)"
      if [[ "${uuid,,}" = "'${profile_name,,}'" ]]
        then echo "${profiles[i]}"
        return 0
      fi
  done
  echo "$profile_name"
}

if [ $1 == "dark" ]; then
  THEME='Yaru-dark'
elif [ $1 == "light" ]; then
  THEME='Yaru-light'
fi
UUID=$(get_uuid $1)

/usr/bin/gsettings set org.gnome.desktop.interface gtk-theme $THEME
/usr/bin/gsettings set org.gnome.Terminal.ProfilesList default $UUID

user@host:~$ chmod a+x .scripts/profile_changer.sh  # Make script executable
user@host:~$ systemctl --user daemon-reload
user@host:~$ systemctl --user enable dark.timer light.timer
user@host:~$ systemctl --user start dark.timer light.timer

The last three commands will refresh the service daemon to and make it look for any file changes, enable timer services to be run in the background on startup and start them now.

That’s less work than expected initially. As most of the time, most of the work came from the StackExchange, from the AskUbuntu thread. Lucky that most of the time there’s someone with a similar question and someone with good answer.

Project closing note: Personal Progress

I’ve been evolving my productivity process for a long time and there are a few aspects that really work for me. One of them is doing official starting and ending ceremonies. By official I mean a session where you pretend to report findings to your boss although they and everyone else are (suspiciously) quite. Often that report means going through the project template and discussing all notes. Then, ceremonially move the project file/note to a different directory with all other completed (not necessarily successfully) projects and writing out learnings.

Since I’ve already shared a bit about the Perosnal Progress, I thought that I might share as well the concluding words. Below is a header of my project file for the Personal Progress. What’s below the header is the project template, followed by notes written on every health checks.


Personal Progress

Activity: Completed
Duration: Mid-term
Status: Success

Leanings

Time estimation

Although I was fairly confident in the time estimation, the target date had to be change a few times. These were related to another project taking higher priority and a few requirement changes to this. Requirement changes were due to used technology and companies behind it. Changing the target date for this project is not a concern; it has little impact on anything.

Technology

Many components I worked with were new to me. The frontend is in React with Bootstrap, backend in Python with Flask, MongoDB as a database, proxying with Nginx and they all are in separate Docker containers. Deployment via the Docker-compose with docker-machine to the Digital Ocean droplets. Despite some frustrating moments I really enjoyed learning all of that. Definitely will try to reuse the stack for future projects.

Personal information

This project allows me to keep track of my personal views. Replying to the same questions every few months/years will let me learn more about myself.


 

What’s not written explicitly in the header is that the project is completed though the webpage is not. There are some bugs to fix and features that would be nice to have, however, they’re not on the top of the priority. The result is good for what is intended to be. If there are requests to change something then that will be taken care of but there’s no value in thinking about it now and being reminded every two weeks. Mental freedom and being fair with myself is more important.

Speeding up EEMD / CEEMDAN

tl;dr: PyEMD documentation has a section on speeding up tweaks.

As an author of the PyEMD package probably the most common question I receive is “Why does it take so long to execute EEMD/CEEMDAN?”. That’s a reasonable question because the EEMD and CEEMDAN can be quite slow. Unfortunately, that is more about the nature of these methods rather than the implementation. (Not saying that the implementation cannot be improved.)

The question often is followed with a description of their signal, that it has 20k+ samples and some weekly seasonality collected for a couple of years with sub-hour frequency. From the perspective of the EMD et al. this means that there are many extrema which in turn means that one needs plenty of disk/memory space to accommodate interim results (especially spline) and that there’s a “higher chance” to produce obtain that odd-extremum which is propagated through all siftings. Unfortunately, it is expected for the full EEMD/CEEMDAN evaluation to take minutes even if the EMD takes a couple of seconds.

Even if EEMD can be parallelized for trails in the ensemble, every added noise will cause slight changes to the signal. EMD is not robust; some perturbation will have no effect and others might return a couple more IMFs than expected. CEEMDAN is even worse in performance because its components depend on each other so the serial in nature with parts that are parallelizable.

I have added a F.A.Q. section to the PyEMD’s Readme file and updated the PyEMD’s documentation with a chapter on factors that affect the performance. These include the used data type, number of iterations and envelope spline selection. Let me know if something is not clear or there are others to be added. It’s been a while since I have played with EMD so maybe there are some significant improvements that I should be aware of.

Let me just do this manually

Have you seen cartoons drawn by the XKCD or, alternatively, Randall Munroe? A definite recommendation to check out his comic strips. They are quite geeky but on a wide range of topics. In particular, he has this cartoon which is a drawing of a spreadsheet with “how often used” vs “time saved”. It’s good generic guidance to consider when being tired with a mundane task and thinking about automating it. My personal spreadsheet is inexistent but if I were to do it there would be additional dimensions, e.g. what’s the learned value. Even if I’m not saving much time the knowledge gain and scratch of curiosity itch is a big win.

Recently I’ve adopted and embraced the containerization through the Docker family. These lightweight, all-inclusive environments allow to develop, build and deploy locally and make sure that when deployed they’ll behave the same. Smack everything in containers, create a yaml template for the docker-composer and deploy on a remote host through docker-machine. Quick and easy. Except for some caveats.

One of the issues that almost made me regret all these containers was with getting CA certificates to terminate TLS on the remote host using the Let’s Encrypt. In short, to obtain the certificate you need to prove that you’re in control of the domain and respond in a specific way for a specific request. Fine but to do that you first need to make the domain responsive thus you need to have some certificates but you don’t have since that’s what you’re trying to do. What to do? Get some self-signed certificates, ask for help, get new certificates, replace the old ones and show that you have new ones. Doing this manually takes a couple of minutes and can be done with a combination of ssh-ing and running a script. Having this done in an automatic fashion as deployment to any host is not that simple.

There are a number of blogs that have tried to describe what to do in such situation but most (that I’ve seen) still focus on using the docker-compose from within the remote host. Unfortunately, that isn’t what I want; for small projects, I want to run a single command from a local host and have everything done automagically. So I have spent days trying out some solutions. Two of my favourite that I’ll return to in the future are Traefik and docker-letsencrypt-nginx-proxy-companion sidecars. The former is nginx replacement with a dashboard and Let’s Encrypt solution, whereas the latter is a container that works with two others to do some magic. In both cases, one has to configure relations either through the environment variables or labels and these then work. Well, it should work but I haven’t actually managed to make them work. The approach with Traefik is nicely documented from the Digital Ocean writers though it takes a while to properly configure everything. The other, nginx-based, is a bit outdated and updating it for example to docker-compose v3… it didn’t go that well.

All in all, I tried to make things run smoothly and automatically, and not needing to ever do them again. What I ended up is to spend 10 min to do things manually, copy over with `scp` and update volume references. Quick and easy. Even writing documentation on how to do it again in the future took me a couple extra minutes, but not days.

Have I done what I was going to? Yes. I have learned new technologies by testing them out and knowing what is where and, also, optimized the future releases through having better documentation and explanation why other things won’t help.

The Personal Progress – opinion tracker

Presenting to you one of the side projects: the Personal Progress.

The goal of the project is to track over long time personal opinions on everything. As we change our social surroundings, we meet new people, we learn about new experiences and we discover things that we haven’t known before. Each event has an impact on us; some will confirm our believes and others will challenge them. The personality changes in slowly but gradually, making it difficult to observe on the day-to-day basis. It’s typically more obvious to people with whom we don’t interact too often. Why leave this entertainment only to them? Go ahead and leave breadcrumbs for yourself.

Unfortunately, it’s difficult to learn about the past yourself unless the Past You allowed for it. Blogs and diaries definitely help in this but their content is typically written in a form that makes it generally difficult to extract specific opinion. The Personal Progress asks direct question and reminds itself every so-often to check for any update.

The result will be achived over years. On top of simple, side-by-side answers’ comparison each question will be classified into broad domains. You’ll be able to check how they group together. Each answer will also be quantified on sentiment and category-specific axes allowing to extract intensity value.

Feel free to give it a go and let me know whether you (dis)like it, you think something requires improvement or simply as for additional features. The tool is still a work in progress but it’s usable and I’m using it.

AWS’ CDK for ECS Fargate, or how to run infrequent large jobs

Yep, it’s one of those titles where unless you know what it’s about you probably don’t know what it’s about. If you feel with these abbreviations, go ahead and scroll down a bit. Otherwise, let me go ahead and expand here a bit: Amazon Web Services’ Cloud Development Kit for Elastic Container Service Fargate. That’s probably not clearer so on to the descriptions we go.

Syllabus

Cloud Development Kit

This is an SDK for the Cloud Formation. It allows to write an infrastructure as if it was a code. It currently supports a couple of languages, including Python and Java, but it feels that the main one is TypeScript. The CDK compiles to native Cloud Formation (Cfn) so whatever is missing in Cfn will also be missed in CDK. Additionally, some modules are in experimental phase which means that their api isn’t fully established. I doubt whether these changes are going to be significant; most likely property naming or different default setting. However, they reserve an option to introduce breaking changes to these modules, e.g. Cognito.

Elastic Container Service (ECS)

As with most AWS services, just ignore the “Elastic” part and we’re set – Container service. It allows to run containers, mainly Docker, away from your machine. It has some functionality to enable Docker Swarm or Kubernetes-like orchestrating and means to provision resources when needed. Actually, currently there are two types of resource provisioning – self- and auto-managed. The self-managed solution is simply called “EC2” as it requires you to provide EC2 or auto-scaling group where the ECS can install its framework and per need utilize required volume. The auto-managed option is called…

Fargate

Treat this like a heavy AWS Lambda and you won’t be too far off. The difference is that the Lambda is often used to run just a code and sometimes the whole runtime provided in a single container. With the ECS you have to provide at least one container which are bundled in a group and renamed to a Task Definition. The Fargate service allows you to forget about everything except mentioned task definitions. They’ll do the provisioning and scaling for you (not for free) but you need to specify metrics based on which you want the scaling in and out.

How to run infrequent large jobs?

A couple of times there have been a situation when occasionally I need to run a large job/script. By a large I mean that its execution on my laptop takes about 10-60 min. This needs to run every week for 100 of different configuration. A use case is retraining a prediction model with the latest weekly report. All in all I need to have a medium computational job that will burst once a week. As with any problem there are many potential solutions. Before stating what’s my preferred design pattern let’s strike out a couple of candidates.

Amazon Lambdas. These would be awesome if they didn’t have a timeout. Unfortunately, access to their process is being shut down <a href="https://docs.aws.amazon.com/lambda/latest/dg/limits.html"after 15 min and, besides, their memory is up to 3Gb which sometimes might be to little. Smart people might suggest dividing the logic into finer granularity, to which I’d say that they’re not that smart and don’t try to fix everything with a hammer.

Why not just have one host instance and run all these jobs one after another? Well, why not just pass exam by changing the question and answering your own? No, I want them all done within an hour since getting the result so I can plan the following week accordingly.

Ok, maybe have a periodic function like a cron job or CloudWatch event and run a lambda function that provisions EC2 hosts, and… ? This quickly becomes dependency hell. You need to provision host, then run there something, deprovision… it quickly changes into a Step Function workflow and you need to maintain code for the infrastructure and its logic. Way too much hassle.

My preferred solution is ECS. Containers have this nice property that once you try them, you like them and you stay with them. What works for me is to have all the logic in a container with specific entrypoint (simple dockerfile example below) and wrapped it into a Task Defintiion that provides arguments (command) to the container. The number of running tasks depends on an SQS size; if it has more than 0 messages then keep on adding tasks. These messages can have additional parameters that the logic knows how to extra. Done. That’s it. The autos calling property will take care that for the majority of time there are 0 containers and as soon as one start sending messages it will increase the number of containers.

How does the CDK come to play here? They provide a solution to do just that with only a few lines of code. CDK has a module called ECS patterns which provides recipes for the common ECS patterns like Application/Network Load Balanced clusters or periodic scheduled jobs. The one that I talked about is called Queue Processing Fargate Service (there’s also EC2 version). Excluding alarms, the whole infrastructure for mentioned services takes about 5 lines (basic example below). There are of course additional parts dependent on your service but the infrequent scaling bit is done. Cool, right?

Example of ECS’s CDK TypeScript

const queue: sqs.Queue = new sqs.Queue(this, 'ResourceQueue', 'MySqsQueue');
const image = ecs.ContainerImage.fromEcrRepository(ecr.Repository.fromRepositoryName(this, 'ResourceName', 'container');

const scalingSteps: Array = [ {change: -1, upper: 0}, {change: 1, lower: 0}, ];
const command: Array = ["--sqsUrl", queue.queueUrl];

const ecsService = new ecs_patterns.QueueProcessingFargateService(this, "FargateService", {
image, command scalingSteps, queue
});

Typical docker for Python jobs


FROM python:3.7-slim

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENTRYPOINT [ "python", "main.py" ]

Refreshment. Ain’t no scientist no more.

In the software world, there’s a saying “if you want users to get excited about a new release, change buttons’ colours.” (I’m saying this.) Go ahead, look around; looks different, right?

As mentioned in previous notes, I want to write more often. Actually, I do write more often but not on this blog. Diving on “why” it seems that there are high expectations on the quality. It isn’t about wording or grammar but rather the feeling that I need to write something science-related. To be fair, not only “science” but either I have contributed to it or have some thoughts. Given that my primary income in the past few years came from making computers do magic, let’s just put it out there: “I ain’t no scientist no more”…

… and now, quickly, clarify, before anyone leaves. I am a scientist; I do feel like one. I’m going to why-analyze every aspect of mundane activity until it reaches either atoms or astral powers. If anyone asks or doubts I’m pulling official email from the Uni. But, that’s not to say that’s my main occupation. Same as some “powerful” people will dumb-down themselves by claiming they’re “Father. Husband. The president of the United States.”, I’m the “Average human. Scientist. A person who uses computers daily be that for creating functional services or performing data manipulations.”

This announcement(?) this needs to come now since I’m getting close to the point when I can start sharing some things I’ve been thinking about and/or worked on. Needless to say, I’ll need some audience to share these ideas/projects with. I won’t go into “please like and subscribe” but if you want to share posts or tell others about any of my work you have my blessing.

What are the changes? Landing page gets refreshment – checked. Static pages with ongoing thoughts and projects. Blog posts on random topics but hopefully in theme as that’s going to be on what I’m working, which hopefully is in theme.

There. Typical blog post on how I’m going to write more blog posts. Excited?! mumble mumble…

Habituating to AWS Glue

Despite my strong opinion against the AWS Glue with its unfriendly documentation and strange approach to anything… I ended up using it as the main framework for an AWS native ETL (Extract, Transform, Load) service. The whole journey felt like trying to make divorced parents get back together. They’re working together but the process felt artificial and not sure whether they’re meant for each other. The success was due to finding out Glue’s some dirty secrets.

What’s the problem?

To be completely fair, the problem with Glue is because of a use case that seems trivial but is surprisingly challenging. The goal is to have a workflow of dependent jobs all of which lift and transform a few Redshift’s tables and upserts the result into the same cluster. Simple, right?
For starters, although the Glue context allows reading from JDBC (Redshift connector) it can only read data by specifying a table name, thus it lifts the whole table. That would be fine if we were dealing with tables up to a few GB since that’s transferred using UNLOAD, which is fast, to S3, which is cheap. In my use case, however, some tables will soon be in TB so lifting the whole table is a waste in bandwidth, connection time and most importantly money spent on the Glue, Redshift and S3.

The first workaround was to use directly the Spark context with it’s JDBC connector. It works nicely for lifting data with custom SQL clauses allowing for joins and wheres, resulting in a DataFrame. Great and almost done. But, now the problem is with upserts. Redshift does not support upserts. The recommended method is to insert into a temporary table, delete all duplicates from the target and then append new data. Spark connector has “execute SQL” method but… it doesn’t support transactions. We definitely want to avoid a situation when the deletion is successful but the insert is corrupted.

At this point, it felt like being betrayed by the Glue promises; no native support for such simple use case and they promised one-stop-shop for all ETL. The AWS is pushing hard to make the Redshift default analytics DB but most tools either provide support big data dumps to Redshift or exports to other services. If I can’t get AWS’ support then let me help myself. Let’s import a custom package. But, since I’m going to do that, why not use Lambdas with Step Functions or Data Pipeline?

Why not Lambda (Step Function) or Data Pipeline?

Both are viable options and both have their quirks. For one, Data Pipeline is significantly limited in what it can do. Nevermind that it looks and feels like a service that owners want to deprecate but it does something Ok and it has dependent users. Not much has changed in the last couple of years and, besides, how seriously can you treat something that requires names to start with “My” like “MyFirstDataPipeline”. There are RedshiftCopyActivity and SqlActivity which might be helpful here but they still require provisioning a resource (EC2 or EMR) to run on. A micro EC2 should be fine but if I’m going to define step by step everything I might as well not limit my activity options and go straight to the AWS Step Functions.

AWS Step Functions seem to be the proper solution. The list of triggering events and what actions can be executed is constantly growing. It seems to be easily expandable and, given that many new services quickly after their release have a hook, it gives a hope that this is The AWS orchestration solution. What’s the quirk? Well, we still need to run the query somewhere. The obvious choice is Lambda. In the majority of cases that should be enough but in general there’s a max timeout of 15 min we already have some queries that take about 20 min. There was a hope that since the query can be written as the Redshift procedure without any output it shouldn’t require an active connection to finish it. Unfortunately, even though neither pg8000 and psycopg wouldn’t cancel their job on the timeout, the Redshift would treat it as a broken transaction and rollback. Since the Lambda is a process and until there’s another requesting the same resource it will live, some hacking might allow to not kill the connection on the timeout but this wouldn’t be reliable. So, two-way-door plan: let’s focus on the Glue and if their workflow is limited we can execute Glue job via Step Functions. Either way, there’s going to be a boilerplate written so it might be starting with the Glue.

Revelation…

I’m using somehow interchangeably “Glue job” and “Glue Shell job” but that’s only to refer that I tried using either of Glue solutions. In reality, these two are completely different beasts and shouldn’t stand close to each other (and the documentation should be definitely more clear on this). The Glue Shell job can be either an EMR job (DCU 1) or EC2 job (DCU 0.0625) in which case that’s a Lambda with a max timeout of 24 hours. Strangely the mechanism of importing custom packages is significantly different. In case of the Glue/EMR job, one can zip packages, upload to S3 and add them via job arguments (–extra-files). For the Glue/EC2 job, these extra packages need to be packaged into a single .egg file and upload exist on the job’s creation. Either case requires Python native code without any binary/C bindings so no psycopg as a connector package and no usage of pip. Difficult, challenging but that’s fine. Whilst debugging unsuccessful import I printed out what are the available packages in the environment and, lo and behold, the solution was always within the reach. It turns out that there are some officially not mentioned packages and one of them is PyGreSQL – a PostgreSQL/Redshift connector. This allows to execute any query on the Redshift without any special magic; just import PyGreSQL and enjoy. Having question marks still flying above my head, we reached out to the AWS support and followed with the AWS Glue team. Long story short, they’ll add the package to officially supported list.

Final solution

After a whole lot of frustration and complaining, we managed to get a lean and extensible solution. Every few hours a trigger executes a workflow of series of depended jobs; some are pure SQL and are Spark jobs. There’s no timeout problem, we have retries, alarms and everything is in a cloud formation script. P1 is ready and now time to start pulling data instead of waiting for pushes.