Refreshment. Ain’t no scientist no more.

In the software world, there’s a saying “if you want users to get excited about a new release, change buttons’ colours.” (I’m saying this.) Go ahead, look around; looks different, right?

As mentioned in previous notes, I want to write more often. Actually, I do write more often but not on this blog. Diving on “why” it seems that there are high expectations on the quality. It isn’t about wording or grammar but rather the feeling that I need to write something science-related. To be fair, not only “science” but either I have contributed to it or have some thoughts. Given that my primary income in the past few years came from making computers do magic, let’s just put it out there: “I ain’t no scientist no more”…

… and now, quickly, clarify, before anyone leaves. I am a scientist; I do feel like one. I’m going to why-analyze every aspect of mundane activity until it reaches either atoms or astral powers. If anyone asks or doubts I’m pulling official email from the Uni. But, that’s not to say that’s my main occupation. Same as some “powerful” people will dumb-down themselves by claiming they’re “Father. Husband. The president of the United States.”, I’m the “Average human. Scientist. A person who uses computers daily be that for creating functional services or performing data manipulations.”

This announcement(?) this needs to come now since I’m getting close to the point when I can start sharing some things I’ve been thinking about and/or worked on. Needless to say, I’ll need some audience to share these ideas/projects with. I won’t go into “please like and subscribe” but if you want to share posts or tell others about any of my work you have my blessing.

What are the changes? Landing page gets refreshment – checked. Static pages with ongoing thoughts and projects. Blog posts on random topics but hopefully in theme as that’s going to be on what I’m working, which hopefully is in theme.

There. Typical blog post on how I’m going to write more blog posts. Excited?! mumble mumble…

Advertisements

Habituating to AWS Glue

Despite my strong opinion against the AWS Glue with its unfriendly documentation and strange approach to anything… I ended up using it as the main framework for an AWS native ETL (Extract, Transform, Load) service. The whole journey felt like trying to make divorced parents get back together. They’re working together but the process felt artificial and not sure whether they’re meant for each other. The success was due to finding out Glue’s some dirty secrets.

What’s the problem?

To be completely fair, the problem with Glue is because of a use case that seems trivial but is surprisingly challenging. The goal is to have a workflow of dependent jobs all of which lift and transform a few Redshift’s tables and upserts the result into the same cluster. Simple, right?
For starters, although the Glue context allows reading from JDBC (Redshift connector) it can only read data by specifying a table name, thus it lifts the whole table. That would be fine if we were dealing with tables up to a few GB since that’s transferred using UNLOAD, which is fast, to S3, which is cheap. In my use case, however, some tables will soon be in TB so lifting the whole table is a waste in bandwidth, connection time and most importantly money spent on the Glue, Redshift and S3.

The first workaround was to use directly the Spark context with it’s JDBC connector. It works nicely for lifting data with custom SQL clauses allowing for joins and wheres, resulting in a DataFrame. Great and almost done. But, now the problem is with upserts. Redshift does not support upserts. The recommended method is to insert into a temporary table, delete all duplicates from the target and then append new data. Spark connector has “execute SQL” method but… it doesn’t support transactions. We definitely want to avoid a situation when the deletion is successful but the insert is corrupted.

At this point, it felt like being betrayed by the Glue promises; no native support for such simple use case and they promised one-stop-shop for all ETL. The AWS is pushing hard to make the Redshift default analytics DB but most tools either provide support big data dumps to Redshift or exports to other services. If I can’t get AWS’ support then let me help myself. Let’s import a custom package. But, since I’m going to do that, why not use Lambdas with Step Functions or Data Pipeline?

Why not Lambda (Step Function) or Data Pipeline?

Both are viable options and both have their quirks. For one, Data Pipeline is significantly limited in what it can do. Nevermind that it looks and feels like a service that owners want to deprecate but it does something Ok and it has dependent users. Not much has changed in the last couple of years and, besides, how seriously can you treat something that requires names to start with “My” like “MyFirstDataPipeline”. There are RedshiftCopyActivity and SqlActivity which might be helpful here but they still require provisioning a resource (EC2 or EMR) to run on. A micro EC2 should be fine but if I’m going to define step by step everything I might as well not limit my activity options and go straight to the AWS Step Functions.

AWS Step Functions seem to be the proper solution. The list of triggering events and what actions can be executed is constantly growing. It seems to be easily expandable and, given that many new services quickly after their release have a hook, it gives a hope that this is The AWS orchestration solution. What’s the quirk? Well, we still need to run the query somewhere. The obvious choice is Lambda. In the majority of cases that should be enough but in general there’s a max timeout of 15 min we already have some queries that take about 20 min. There was a hope that since the query can be written as the Redshift procedure without any output it shouldn’t require an active connection to finish it. Unfortunately, even though neither pg8000 and psycopg wouldn’t cancel their job on the timeout, the Redshift would treat it as a broken transaction and rollback. Since the Lambda is a process and until there’s another requesting the same resource it will live, some hacking might allow to not kill the connection on the timeout but this wouldn’t be reliable. So, two-way-door plan: let’s focus on the Glue and if their workflow is limited we can execute Glue job via Step Functions. Either way, there’s going to be a boilerplate written so it might be starting with the Glue.

Revelation…

I’m using somehow interchangeably “Glue job” and “Glue Shell job” but that’s only to refer that I tried using either of Glue solutions. In reality, these two are completely different beasts and shouldn’t stand close to each other (and the documentation should be definitely more clear on this). The Glue Shell job can be either an EMR job (DCU 1) or EC2 job (DCU 0.0625) in which case that’s a Lambda with a max timeout of 24 hours. Strangely the mechanism of importing custom packages is significantly different. In case of the Glue/EMR job, one can zip packages, upload to S3 and add them via job arguments (–extra-files). For the Glue/EC2 job, these extra packages need to be packaged into a single .egg file and upload exist on the job’s creation. Either case requires Python native code without any binary/C bindings so no psycopg as a connector package and no usage of pip. Difficult, challenging but that’s fine. Whilst debugging unsuccessful import I printed out what are the available packages in the environment and, lo and behold, the solution was always within the reach. It turns out that there are some officially not mentioned packages and one of them is PyGreSQL – a PostgreSQL/Redshift connector. This allows to execute any query on the Redshift without any special magic; just import PyGreSQL and enjoy. Having question marks still flying above my head, we reached out to the AWS support and followed with the AWS Glue team. Long story short, they’ll add the package to officially supported list.

Final solution

After a whole lot of frustration and complaining, we managed to get a lean and extensible solution. Every few hours a trigger executes a workflow of series of depended jobs; some are pure SQL and are Spark jobs. There’s no timeout problem, we have retries, alarms and everything is in a cloud formation script. P1 is ready and now time to start pulling data instead of waiting for pushes.

Workspace with the tmux

Most of my projects now are separated from the beginning into logical entities and put into docker. This makes the deployment and replacement much easier to manage but the development many rituals comands.

Tmux to the rescue. Being a terminal/cmdline guy I like to have all my logs and dashboards in one view. Code below is used to for Flask Python backend, React Javascript frontend and MongoDb.

#!/bin/sh
cd ~/project
sudo docker run -d -p 27017:27017 -v ~/project/db/mongo-data:/data/db mongo

tmux new-session -d -s Development -n server -c ~/project/server "./venv/bin/python server.py"
tmux split-window -v -c ~/project/client "yarn start"
tmux split-window -h -c ~/project/db "mongo"
tmux new-window 'mutt'
tmux -2 attach-session -d

AWS Glue? No thank you.

I’ve been postponing describing my experience with AWS Glue for a couple of months now. It’s one of those things that I really wanted to get out of myself but it hurts to even speak up. Let’s end this pain, let’s end this now. Ehkem… AWS Glue sucks.

We had a use case of doing a daily ETL job from/to Redshift. The transformation was rather simple but given that developers will maintain the logic it was easier to write code rather than convoluted SQL. The job was small enough (<50k rows) that probably a Lambda with a longer timeout would be just fine; however, since more projects were coming that required a larger scale processing we were looking for potential candidates. This was a great opportunity to try out a service that boasts itself on the official blurb.

Issues started right away. Their documentation is/was really terrible. It describes how "great things are" rather than what they do. There were two pages dedicated to Redshift and they were convoluted enough that even AWS support team had difficulties understanding it. When deploying through Cloud Formation some options were missing and had to be manually updated, like activating trigger for cron job(?!). At the time of writing also only Python in version 2.7 was available with examples written by some Golang users, something like:

orgs = orgs.drop_fields(['other_names',
'identifiers']).rename_field(
'id', 'org_id').rename_field(
'name', 'org_name')

No doubt that AWS Glue will be updated and most likely it’s much better now than it was 2 months before. However, I have such a terrible mouthfeel after using it that it’s going to be hard to convince me to give it another shot in the near future. For simple tasks Lambda should be enough and for larger on a single data source use EMR. In cases when there are multiple sources with dependencies orchestrate everything using Data Pipeline. Seems that the Glue is an on-demand EMR with limited not-optimal configuration thus leaving with limited control.

Exist – the personal dashboard

Let me share with you the thing I found on the internet. That thing is called Exist.io. They aren’t paying me for this ad but I’m still going to do it as it reignited my push for the quantified self.

The premise is simple. The Exist.io asks for your permissions to certain services and on your behalf collects, aggregates and presents these as nice graphs. It’s a personal dashboard with metrics making it easier to observe progress. They also provide means for adding metrics such as mood scale and, a very limited, tagging system but based on the transparent roadmap that might change one day.

Yes. the privacy issue is huge, so you simply have to trust them that they won’t abuse permissions. However, based on their privacy policy and business model (monthly subscription), I’m less concerned about them than probably most other services. Besides, not knowing where exactly these things go or how they are protected I’m not going to give them access to sensitive data such as email or location. Feel free to steal my sleep time, coffee intake or steps count (might even publish these one day).

The dashboard works for me. It doesn’t give many meaningful insights but I like that everything is in a single place and I can export from there to my personal storage. I can also tell what is missing and how I’d like to collect the other information. It made me even start working on my personal tools for collecting productivity time and forecasting mood changes and “life cycles”. It’s a great source of personal multivariate time series (clickbait) understanding of which will literally change your life.

Overall, big thumbs up. If you like to know more about yourself, especially to have solid numbers in a single place go for it. And, if you think about yourself in number, let me know. I’m planning to work on a few projects that would like to release to broader an audience in the “near” future.

Referral link that gives me $2: exist.io

Barren pages get some words raining

There has been a surprising shift in my approach to writing. Any external request to express or clarify something in a written for would evoke a passionate hatred. I could feel the amount of time that is going to be wasted for no good reason what-so-ever. What others might complete in 5 min I would need to spend more than an hour. The reason was that many people considered my expression style as “weird” or “unusual”, and would often ask for clarifications in the way they wanted just to make sure that we are on the same page. Talking is easier because you quickly clarify these concerns and carry on. Writing, however, is a double pain. I’d spend hours on rewiring and polishing a couple of sentences trying to clarify what might be potentially unclear and, then, I’d spend a couple on replying to feedback which typically meant “unclear.” It’s a frustration spiral.

At times when I’d panic and plea for help, the most common advice was reading more books. Probably great advice, but doesn’t work for me. As an (too) active person, I have trouble sitting down and reading. The most productive reading is when I’m walking in a dull room with white noise and without anyone. As you might expect, this isn’t always possible, so my reading is slow. How can you write clearly if you don’t know the rules, and how do you know the rules if school grammar is different from the common folks’ expectation?

The solution isn’t clear but I think I’m approaching it. Not only starting a new paragraph isn’t problematic anymore, but I’m also spending less time on polishing expressed thoughts. A year ago I’d quickly run away from any writing task but now I’m scribbling a coupled of pages every day. Some pages are for myself; personal notes and thoughts to keep my life organized and properly archived. Others are for work. Even though, at the moment, I’m a Software Developer Engineer, which some might imagine solely writing code, I’m daily updating documents, be that a documentation, design explanation or ideas pitching.

What’s the change? To be honest, I’m not sure and it might be anything. But, if I were to highlight a few potential reasons, three that come to mind are:

  1. The Elements of Style” by Strunk & White. It took me a while to start reading this book but the delay was mainly because “it is a book” and books are scary as they typically have more than 50 pages (too much). But, depending on the edition, it’s about 30 pages long and concisely explains “dos and don’ts” in writing. Logical rules to follow; a simple guideline to follow; a blessing!
  2. Grammarly. A service and a web application acting a spell-checker on steroids. It detects grammar misuses, unclear sentences, word repetitions and has a built-in thesaurus. Each detection comes with a brief explanation of why it was triggered and how to solve the issue. Given the broad demand and lack of alternatives, there’s definitely more to come from the Grammarly.
  3. Lower expectations. Quality bar for writing outside academia is significantly lower. This isn’t an insult. It’s actually great to not stress out about imperfect sentences and slightly ambiguous words. Words don’t carry that much liability so people are encouraged to make mistakes are that can be cleaned in the process. The focus is more on the story progression and less on defending thoughts. Lower threshold allows realistically learning from experience.

This blog as a whole was intended to be scientific. Even when I wrote that it’ll be less ‘scientific’ than before, I meant that it’d be less ‘precise’ in words’ synthesis. Given that I already left academia and my ties with the scientific side are decreasing, the intention needs to change. Expect more docs on general topics but still in engineering/technology theme. I’m still considering having a separate blog dedicated to philosophy, so, who knows?

Project template

Having more time I started retrospect on my previous projects and attempt to extract characteristics that lead either to success, or failure. As one might expect, these characteristic change in time. Currently, I’m doing better off dividing work into chunks and deal one at the time but a while back it was easier to just get into working and the immediate task would pop out. Going forward my habits and motivation will definitely change again so it’s difficult to impose a strict ‘successful project’ template, but, we aren’t setting things in stone and if needed we’ll change them.

Analysing notes from the last 10 years I noticed that there were a few common parts which, if defined properly allowed me to succeed in the project or fail gracefully. I use “properly” because sometimes I tried to cheat myself expecting that future me might forget and will be fooled by the writing. However, the only thing that I managed to forget was that the “forgetting” part doesn’t work. Even now I have a vivid experience of changing the expectations for each project but had to think a bit why I did that. Hopefully I wiser.

Now, whenever I’m thinking about starting a new project it goes through a few phases.

Project phases

Phase 0

Quickly write a sentence or two about the project. It’s usually a sentence on what the idea is and how I came to think about it. This typically allows me to let that thought go for a while and continue on whatever I was doing before. For such quick notes I’m using Google Keep – two taps to input note – but I’m looking to escape Google. Once it’s written I can let it mature for some time, typically until the next evening.

Phase 1

Note: If the project is super short or it doesn’t make sense to start it in a few weeks, I’ll skip over this phase or delay it.

Typically a day delay and a full night rest allow me to validate whether it makes sense to devote some time to the idea. In this phase I’ll open a text editor and will try to write at least 100 words on how I see this idea to grow and what would it mean to do it. Writing out prevents me from cheating myself with the optimism of the moment. It’s a bit tedious to write on something in my head sounds awesome but the logic is if I can’t be bothered to spend 5 min writing out on the project it’s unlikely I’m going to want to work on it.

Phase 2

Create a document to monitor the progress of the project and populate the top with the following template:


{Project Name}

Success: {The definition of successfully finished project.}
The **Why** statement: {Why am I doing this project?}
Expected length: Coarse: {Long/mid/short term}
Fine: {XX weeks/months}
Health check: {Define period of time to check on the project.}

The goal is to check these properties every {health check} period and make sure that they’re still valid. This task won’t be successful unless the table is easy to read and short. The rest of the document should be devoted to progress updates. Depending on the project this can be in form of a timeline with timestamps, or living document. I’ve started using the http://notion.so/ for this; it’s a collaborative, realtime doc with plenty of key shortcuts and markdown enabled.

Phase 3

This phase is about execution. Every {health check} review priority of the project and set aside some time for it. If it happens that for a few {health checks} you weren’t able to do anything related, go ahead and update the project. I like to keep track of ongoing projects in a special Trello board and create individual tasks to “daily” board. I could’ve used Notion for it as well, but I find Trello awesome for day-to-day tasks and it’s easy to accidentally check what’s going on when you’re a click away with a constant reminder.

It’s just a phase

There’s no point in cheating yourself that you’re going to do something when you don’t want to. Almost everyone I talked to had experienced the inability to do anything caused by mental overload with things that had to be done. I’m assuming that the driver for these projects wasn’t a necessity but the choice to do something awesome in spare time. If that’s the case feel free to let some go. That’s difficult due to the sunk cost fallacy but the goal of this template is to help rationalise the decision. Our values and motivation change, and so should our priorities. (It won’t be probably long before I feel stupid for ever writing this blog post.)