Let me just do this manually

Have you seen cartoons drawn by the XKCD or, alternatively, Randall Munroe? A definite recommendation to check out his comic strips. They are quite geeky but on a wide range of topics. In particular, he has this cartoon which is a drawing of a spreadsheet with “how often used” vs “time saved”. It’s good generic guidance to consider when being tired with a mundane task and thinking about automating it. My personal spreadsheet is inexistent but if I were to do it there would be additional dimensions, e.g. what’s the learned value. Even if I’m not saving much time the knowledge gain and scratch of curiosity itch is a big win.

Recently I’ve adopted and embraced the containerization through the Docker family. These lightweight, all-inclusive environments allow to develop, build and deploy locally and make sure that when deployed they’ll behave the same. Smack everything in containers, create a yaml template for the docker-composer and deploy on a remote host through docker-machine. Quick and easy. Except for some caveats.

One of the issues that almost made me regret all these containers was with getting CA certificates to terminate TLS on the remote host using the Let’s Encrypt. In short, to obtain the certificate you need to prove that you’re in control of the domain and respond in a specific way for a specific request. Fine but to do that you first need to make the domain responsive thus you need to have some certificates but you don’t have since that’s what you’re trying to do. What to do? Get some self-signed certificates, ask for help, get new certificates, replace the old ones and show that you have new ones. Doing this manually takes a couple of minutes and can be done with a combination of ssh-ing and running a script. Having this done in an automatic fashion as deployment to any host is not that simple.

There are a number of blogs that have tried to describe what to do in such situation but most (that I’ve seen) still focus on using the docker-compose from within the remote host. Unfortunately, that isn’t what I want; for small projects, I want to run a single command from a local host and have everything done automagically. So I have spent days trying out some solutions. Two of my favourite that I’ll return to in the future are Traefik and docker-letsencrypt-nginx-proxy-companion sidecars. The former is nginx replacement with a dashboard and Let’s Encrypt solution, whereas the latter is a container that works with two others to do some magic. In both cases, one has to configure relations either through the environment variables or labels and these then work. Well, it should work but I haven’t actually managed to make them work. The approach with Traefik is nicely documented from the Digital Ocean writers though it takes a while to properly configure everything. The other, nginx-based, is a bit outdated and updating it for example to docker-compose v3… it didn’t go that well.

All in all, I tried to make things run smoothly and automatically, and not needing to ever do them again. What I ended up is to spend 10 min to do things manually, copy over with `scp` and update volume references. Quick and easy. Even writing documentation on how to do it again in the future took me a couple extra minutes, but not days.

Have I done what I was going to? Yes. I have learned new technologies by testing them out and knowing what is where and, also, optimized the future releases through having better documentation and explanation why other things won’t help.

AWS Glue? No thank you.

I’ve been postponing describing my experience with AWS Glue for a couple of months now. It’s one of those things that I really wanted to get out of myself but it hurts to even speak up. Let’s end this pain, let’s end this now. Ehkem… AWS Glue sucks.

We had a use case of doing a daily ETL job from/to Redshift. The transformation was rather simple but given that developers will maintain the logic it was easier to write code rather than convoluted SQL. The job was small enough (<50k rows) that probably a Lambda with a longer timeout would be just fine; however, since more projects were coming that required a larger scale processing we were looking for potential candidates. This was a great opportunity to try out a service that boasts itself on the official blurb.

Issues started right away. Their documentation is/was really terrible. It describes how "great things are" rather than what they do. There were two pages dedicated to Redshift and they were convoluted enough that even AWS support team had difficulties understanding it. When deploying through Cloud Formation some options were missing and had to be manually updated, like activating trigger for cron job(?!). At the time of writing also only Python in version 2.7 was available with examples written by some Golang users, something like:

orgs = orgs.drop_fields(['other_names',
'identifiers']).rename_field(
'id', 'org_id').rename_field(
'name', 'org_name')

No doubt that AWS Glue will be updated and most likely it’s much better now than it was 2 months before. However, I have such a terrible mouthfeel after using it that it’s going to be hard to convince me to give it another shot in the near future. For simple tasks Lambda should be enough and for larger on a single data source use EMR. In cases when there are multiple sources with dependencies orchestrate everything using Data Pipeline. Seems that the Glue is an on-demand EMR with limited not-optimal configuration thus leaving with limited control.

Toggling academia status to halted

There was a significant update on my title. Since the end of November, I am officially a PhD. The relief is immense. Obviously, life goes on and nothing has significantly changed on the outside but I can see that my approach to things lighten up and the approach of “Yes can do” returned. I’m open to new projects and ideas.

Surprisingly enough once just before submitting the final version, I stared (again?) to recognise the greater contribution that the work has and it might have. Given that the Machine Learning community is again gradually incorporating the model-based approaches and go smaller on distance (calculus). Such progress opens up opportunities to apply my work to the broader area of interest.

When will this finish…

For the past few years, my life is on hold. Yes, I go to work and do something there but the majority of the time I’m still spending on PhD. It’s such an existential trap. It’s close to the second year when I’m trying to impress a single person who doesn’t really care. It’s close to four years when I’m trying to improve some idea that I had and thought that it might work because the previous 3 years gave no results.

When I started the PhD I was motivated, interested in everything and shaking from the excitement that I’ll be pushing humanity forward. Now, I just want to do the minimum required. In the hindsight, I’ve wasted my life. Nothing good is coming from this. Hopefully, that is “yet”. December is in or out and, at this stage, I don’t really care.

AWS Polly GUI

Although learning and book knowledge are the best, my personal relationship with reading activity is not the friendliest. Being focused on the text is a huge struggle and I often need to re-read sentences to actually read it. That’s why sometimes I use text-to-speech (TTS) software or service.

Few years ago I discovered an Ivona Text-to-speech software which was far superior to any other TTS solution. It was able to quickly read out loud (and clear) text from my clipboard. Not only it was better than others but also it supported Polish – my language. Even though the default software wasn’t useful for my use cases, i.e. scientific papers have unusual formatting, it wasn’t that difficult to write a wrapper and GUI around the Ivona. Unfortunately, it’s not supported anymore and one cannot download the offline version.

Currently, Ivona is owned by the Amazon and its voices are accessible through the Polly AWS service. It’s a relatively a cheap service but one still has to have an internet connection and it’s not provided with any gui. At least officially.

I’ve written an application to use AWS Polly. It’s a simple graphical interface with some formatting options for the text but it does its job. The AWS Polly GUI is accessible from my GitHub page. It’s running on Python3 with PyQt5.

Features are updated as needed so if something might be helpful to anyone, feel free to contact me or create a ticket issue on the repository. I’m using this for my personal work so I’m not planning on leaving this on a side.

Google wants back my microphone

My “writing” work currently goes somewhere else and have little motivation to write anything here. But, there’s something that only internet can help, whether that’s through actual help or simply transferring my annoyance.

In the past few days/weeks there has been some uproar about Facebook listening to us and later subtly suggesting products about which we talked with others. With these it’s hard to point who is objective, so I’ll paste link to web searches and I’m sure you’ll find some “evidence” – Google, Bing and DuckDuckGO. Let me also suggest Reply All podcast who recently had episode on this mysteriously called Is Facebook Spying on You?. Obviously Facebook denies all of this, but they confirm having lots of information about you whether that’s from you directly or from your friends.

Facebook and I are not in good terms for a long time. It’s more a fun social experiment rather than actual social platform. Since it isn’t on my phone there’s nothing to complain about, but there’s another omnipresent God – Google. Actually I have one of its branded phone with turned on Google Assistant, so it had to be there and had to listen to me.

Long story short, I removed microphone permissions from all Google services. Obviously some weren’t happy with this, but I can’t see how this should affect their usage. Except for Google Assistant or occasional input features, nothing should care, right? No. This is really tough break up as from time to time I’m getting vocal suggestions that are close to being commands. Google calls me to when it’s safe you’ll first need to use your phone’s screen and tap the notification then you can let the Google App access some things on your device. This is especially annoying when I’m listening to podcasts or music.

In the beginning this would go on and on, but now it’s more once a day. I don’t think that it has some “time decreasing” variable build in, so it’s definitely my action. More surprising is that even if I quickly unlock phone there won’t be anything new to give permissions to. Also, it might be only happening when the phone is locked as I haven’t had this happening otherwise.

Free AWS is good. Not awesome, but good.

Amazon with it’s Amazon Web Service (AWS) is pretty cool. It gives you access to remote machine which you don’t have to maintain. Actually you don’t have to do anything other than use it. All machines come in different flavours, but what tastes better than free? Granted that it’s extremely limited, but surely we can squeeze something out of it. Right?

AWS instances, i.e. remote machines, differ in the amount of RAM, disc space, operating system, whether they have GPU access and so on. As you can expect free tier instance is pretty low on all measure values. To be more precise free tier instance is of t2.micro type, which is a general purpose burstable instance with a single CPU, 1 GiB memory and EBS data storage (default 4Gb storage).

What is this good for? Depending on the needs, this might be good for almost anything that doesn’t require whatever these instances are lacking. (Did I help?) Obviously. So it’s not so good for heavy computations, training machine learning models or storing data. First of all, it’s better to use for these some other services like S3, DynamoDB, Lex or general machine learning. However, in case of specific requirements, it’s always better just to rent(?) more powerful instance.

These cheap instances, in my option, are very good for few tasks. The main one is web scrapping. This is tedious task that requires small CPU bandwidth, but constant access to the internet. Moreover, we don’t really want to make many calls in small time period so there needs to be a delay between each download. That’s either because we would like to avoid being detect as a bot, or for simply politeness to the owner of the server (not clogging bandwidth).

Internet is full of examples of scrappers for different type of data. I’m adding my own to the collection with r-u-listening project. The core of the project is to allow for users to find similar music to their input. It is a bit more than recommender, but more on this project probably in the future. The scraper itself is more in two parts, i.e. crawler.py and scraper.py. The database that I’m using is FreeMusicArchive.org, which goes with slogan “It’s not just free music; it’s good music”. I do recommend it and once I have something valuable I’d like to share it with them.

Unfortunately these instances don’t come with big default memory and storage. By default they have only 4 Gb storage, which when downloading mp3 tracks will be enough for about 800 tracks (assuming about 5 Mb per track). Again, as always, it depends on the task, but for machine learning algorithms we go with The more, the merrier.

As mentioned before, free tier instances allow up to 32 Gb. To do so go to EC2 service in your AWS console. In the options tab (left side) find Elastic Block Store (EBS) and select Volumes. Then select your instance and Actions, and Modify Volume. Simple, right? In all honesty, like many things in the AWS.

I’ve been using AWS for a while. Even finished AWS general course, its essentials and 3 day onsite workshop on Architecting on AWS. All is pretty simple and consistent. I like it.