Google wants back my microphone

My “writing” work currently goes somewhere else and have little motivation to write anything here. But, there’s something that only internet can help, whether that’s through actual help or simply transferring my annoyance.

In the past few days/weeks there has been some uproar about Facebook listening to us and later subtly suggesting products about which we talked with others. With these it’s hard to point who is objective, so I’ll paste link to web searches and I’m sure you’ll find some “evidence” – Google, Bing and DuckDuckGO. Let me also suggest Reply All podcast who recently had episode on this mysteriously called Is Facebook Spying on You?. Obviously Facebook denies all of this, but they confirm having lots of information about you whether that’s from you directly or from your friends.

Facebook and I are not in good terms for a long time. It’s more a fun social experiment rather than actual social platform. Since it isn’t on my phone there’s nothing to complain about, but there’s another omnipresent God – Google. Actually I have one of its branded phone with turned on Google Assistant, so it had to be there and had to listen to me.

Long story short, I removed microphone permissions from all Google services. Obviously some weren’t happy with this, but I can’t see how this should affect their usage. Except for Google Assistant or occasional input features, nothing should care, right? No. This is really tough break up as from time to time I’m getting vocal suggestions that are close to being commands. Google calls me to when it’s safe you’ll first need to use your phone’s screen and tap the notification then you can let the Google App access some things on your device. This is especially annoying when I’m listening to podcasts or music.

In the beginning this would go on and on, but now it’s more once a day. I don’t think that it has some “time decreasing” variable build in, so it’s definitely my action. More surprising is that even if I quickly unlock phone there won’t be anything new to give permissions to. Also, it might be only happening when the phone is locked as I haven’t had this happening otherwise.

Free AWS is good. Not awesome, but good.

Amazon with it’s Amazon Web Service (AWS) is pretty cool. It gives you access to remote machine which you don’t have to maintain. Actually you don’t have to do anything other than use it. All machines come in different flavours, but what tastes better than free? Granted that it’s extremely limited, but surely we can squeeze something out of it. Right?

AWS instances, i.e. remote machines, differ in the amount of RAM, disc space, operating system, whether they have GPU access and so on. As you can expect free tier instance is pretty low on all measure values. To be more precise free tier instance is of t2.micro type, which is a general purpose burstable instance with a single CPU, 1 GiB memory and EBS data storage (default 4Gb storage).

What is this good for? Depending on the needs, this might be good for almost anything that doesn’t require whatever these instances are lacking. (Did I help?) Obviously. So it’s not so good for heavy computations, training machine learning models or storing data. First of all, it’s better to use for these some other services like S3, DynamoDB, Lex or general machine learning. However, in case of specific requirements, it’s always better just to rent(?) more powerful instance.

These cheap instances, in my option, are very good for few tasks. The main one is web scrapping. This is tedious task that requires small CPU bandwidth, but constant access to the internet. Moreover, we don’t really want to make many calls in small time period so there needs to be a delay between each download. That’s either because we would like to avoid being detect as a bot, or for simply politeness to the owner of the server (not clogging bandwidth).

Internet is full of examples of scrappers for different type of data. I’m adding my own to the collection with r-u-listening project. The core of the project is to allow for users to find similar music to their input. It is a bit more than recommender, but more on this project probably in the future. The scraper itself is more in two parts, i.e. crawler.py and scraper.py. The database that I’m using is FreeMusicArchive.org, which goes with slogan “It’s not just free music; it’s good music”. I do recommend it and once I have something valuable I’d like to share it with them.

Unfortunately these instances don’t come with big default memory and storage. By default they have only 4 Gb storage, which when downloading mp3 tracks will be enough for about 800 tracks (assuming about 5 Mb per track). Again, as always, it depends on the task, but for machine learning algorithms we go with The more, the merrier.

As mentioned before, free tier instances allow up to 32 Gb. To do so go to EC2 service in your AWS console. In the options tab (left side) find Elastic Block Store (EBS) and select Volumes. Then select your instance and Actions, and Modify Volume. Simple, right? In all honesty, like many things in the AWS.

I’ve been using AWS for a while. Even finished AWS general course, its essentials and 3 day onsite workshop on Architecting on AWS. All is pretty simple and consistent. I like it.

Matrix Multiplication with Python 3.5

Only recently I have started to use Python 3. It’s been out for good 8+ years and all these excuses about incompatibility with some packages were just lazy. Most packages I use are already ported and if I ever find that something is incompatible… well, I’ll think then. But for now let me pat myself on the back for this great leap, because:

In Python 3.5.3 (released today) there is an operator for matrix multiplication! Check out: PEP 465 — A dedicated infix operator for matrix multiplication. The choice of operator, @, is a bit unfortunate, because of the decorators and general association with reference/internet, but seeing how few possibilities are left it’s probably the best choice.

Yes, this is big news for me. The number of times I confused myself with my own matrix operations is just too damn high! I cannot agree more with the author of the PEP 465, so let my shamelessly copy&paste (paraphrased) his reasoning. Behold!

(…) encounter many mathematical formulas that look like:

S = ( H β r ) T ( H V H T ) − 1 ( H β r )

Here the various variables are all vectors or matrices (details for the curious: [5] ).

Now we need to write code to perform this calculation. In current numpy, matrix multiplication can be performed using either the function or method call syntax. Neither provides a particularly readable translation of the formula:

import numpy as np
from numpy.linalg import inv, solve

# Using dot function:
S = np.dot((np.dot(H, beta) - r).T,
           np.dot(inv(np.dot(np.dot(H, V), H.T)), np.dot(H, beta) - r))

# Using dot method:
S = (H.dot(beta) - r).T.dot(inv(H.dot(V).dot(H.T))).dot(H.dot(beta) - r)

With the @ operator, the direct translation of the above formula becomes:

S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)

Notice that there is now a transparent, 1-to-1 mapping between the symbols in the original formula and the code that implements it.

Of course, an experienced programmer will probably notice that this is not the best way to compute this expression. The repeated computation of H β r should perhaps be factored out; and, expressions of the form dot(inv(A), B) should almost always be replaced by the more numerically stable solve(A, B) . When using @ , performing these two refactorings gives us:

# Version 1 (as above)
S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)

# Version 2
trans_coef = H @ beta - r
S = trans_coef.T @ inv(H @ V @ H.T) @ trans_coef

# Version 3
S = trans_coef.T @ solve(H @ V @ H.T, trans_coef)

Notice that when comparing between each pair of steps, it’s very easy to see exactly what was changed. If we apply the equivalent transformations to the code using the .dot method, then the changes are much harder to read out or verify for correctness:

# Version 1 (as above)
S = (H.dot(beta) - r).T.dot(inv(H.dot(V).dot(H.T))).dot(H.dot(beta) - r)

# Version 2
trans_coef = H.dot(beta) - r
S = trans_coef.T.dot(inv(H.dot(V).dot(H.T))).dot(trans_coef)

# Version 3
S = trans_coef.T.dot(solve(H.dot(V).dot(H.T)), trans_coef)

Readability counts! The statements using @ are shorter, contain more whitespace, can be directly and easily compared both to each other and to the textbook formula, and contain only meaningful parentheses. This last point is particularly important for readability: when using function-call syntax, the required parentheses on every operation create visual clutter that makes it very difficult to parse out the overall structure of the formula by eye, even for a relatively simple formula like this one. Eyes are terrible at parsing non-regular languages. I made and caught many errors while trying to write out the ‘dot’ formulas above. I know they still contain at least one error, maybe more. (Exercise: find it. Or them.) The @ examples, by contrast, are not only correct, they’re obviously correct at a glance.

Again: yes!

More links

A while ago I’ve started to taste a bit how it feels to work in industry and it feels quite nice. Maybe that’s the specificity of field projected onto the industry, or being tired of how academia works, but I’m enjoying extremely learning all the details about Computer Science, programming and newest technologies.

In addition to last post about Data Science, which still is my main daily ‘look for’, I’ve started to dive deep into computer science. Obviously there are plenty of good information sources and excellent tutorials. Aggregate that I exploiting right now are:

 

I’m planning to add some subpage with links for further reference. Any suggestions are welcomed!

Data Science podcasts

I’m an avid podcast listener. Whenever there’s something that only requires sight or not much focus I’ll try to do it with my headphones on. Great thing about podcasts is that they they are more up-to-date than audiobooks and have reasonably short lengths, so there’s always a fit.

Wanting to be more current with machine learning topics I’ve found few podcasts. These are my recommendations:

Machine Learning competition on Seizure Prediction

tl;dr: Read subject and click on link below.

Some of you, i.e. those lucky ones with connection to the outside World, are probably aware about Machine Learning community trying to aggressively change our lives for better. Regardless whether we like it or not, they’re doing it. Some do this for money, others for fame, and those wicked ones just for fun.

Kaggle is a webpage that hosts Machine Learning competitions. They provide data (usually donated by companies or public organizations) and set a goal. These included detecting and classifying specific whales species from satellite images, or driving a remote car based on an hour of recording, or identifying patients that will return based on historic records, or … It’s actually pretty big. Some prices can be as big as $500,000.

Reason behind this email is one of Kaggel’s recent competitions — Predicting seizures in long-term human intracranial EEG recordings. The challenge is to “The challenge is to distinguish between ten minute long data clips covering an hour prior to a seizure, and ten minute iEEG clips of interictal activity.” Yes, many people has tried this and it’s ongoing research in many labs. The difference here is that you can actually see people’s attempts and their codes. You can read their discussions and follow their logics. It looks like an amazing source of information! Moreover, good contestants are really good at machine learning and they often do their work properly, i.e. complete the challenge.

This all is really important to me. As my background is much in EEG analysis and machine learning. The timing is a bit unfortunate, but I might give it a go.

http://blog.kaggle.com/2016/10/14/getting-started-in-the-seizure-prediction-competition-impact-history-useful-resources/

Facebook birthday

This blog needs some lightening up and what a better way to do that than post some graphs? Exactly! I like to collect data on various things and then make a  graph of them. Things are much better when presented with axis and some number around it.

Not so long time ago I had birthday. Although it happens rather regularly, the last one was special. For the first time I’ve enabled notification on Facebook. It appears that people (a.k.a. friends) are rather nice and are willing to write something positive about you on that day. Here is to memory of that special day!

Summary:
I’ve scraped Facebook’s posts and messages which were within 48 h from the Birthday 00:01 AM. Sent times are aggregated on the graph and text content is summarised in word stats.

 

14370006_1195945397123883_1230729249931003949_nGreen graph displays wishes density as a function of time obtained using Gaussian Kernel Distribution Estimation. Blue dots show cumulative wish count. Both are normalised to highest value being 1. Statistically speaking, green and blue shows (up to constants) probability density function and cumulative density function respectively.

What is nice about this graph is that it generally shows at what time of the day my peers are active. It seems that majority is active in the morning 10-11 am and after 8 pm, which is reassuring that they are very normal average people.

Additional insight gives extracted content. Although there’s not enough data, I think it nicely hints on my background. I’ll leave interpretation, but will also point up that the percentage of wish to total falls closely to the ratio of people who read posts, i.e. 12 – 16 % (even though this is not related as birthday notification goes to everyone).

Wishes:
Happy Birthday: 34 (38.20%)
Wszystkiego najlepszego/All the best: 11 (12.36%)
(tylko) Najlepszego/(only) The best: 10 (11.24%)
Sto lat/100+ years: 16 (17.98%)

Emoji/Emoticons:
Smileys: 25 (28.09%)
Kisses: 13 (14.61%)
Hearths: 3 (3.37%)

Reference:
Dawid: 20 (22.47%)
You: 17 (19.1%)
Boy: 2 (2.23%)

Total number of exclamation mark (!):
!: 82
!!: 11
!!!: 4
!!!!: 2
!!!!!: 1

Frequentism and Bayesianism

This won’t be long entry. Just wanted share a nice introductory tutorial on Bayesian statistics and its comparison to Frequentism. It didn’t start my interest in Bayesian, but definitely provided with few nice insights. I have referred to it many times, thus it is only fair if I share it with bigger audience.

Tutorial: http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/

Paper partially based on referred content: http://arxiv.org/abs/1411.5018

Mode mixing in EMD

One of the problems researchers claim that empirical mode decomposition (EMD) has is the difficulty of unmixing mixed modes [1]. Usually it means that EMD cannot (always) untangle two or more sine mono frequency components s(t) = \sum_{n=0}^{N} A_n\sin(\omega_n t). Some say that this is big issue and propose tweaks to EMD [2], additional methods [3] or provide restrictions on when it is possible to extract [4]. The most widely analysed case is for N=2, i.e. s(t) = \sin(t) + b \sin(\omega t), where b and \omega are relative amplitude and frequency respectively. This has been studied thoroughly in [4].

Why is this not necessarily an issue? EMD is supposed to extract intrinsic mode functions (IMFs), i.e. functions with a single mode/frequency. Although, when hearing oscillations many probably think about sinusoid, but it actually can be in any form. Wavelets are an example of weirdly looking oscillations (small waves that have beginning and end). As for the definition of oscillatory behaviour there is not a single best one. Huang et al. [5] defined IMF as a function that:

  1. has number of local extrema and zero-crossings equal or different by 1,
  2. mean of its top and bottom envelopes is zero.

Authors were aiming to meta-define a function that is suitable for Hilbert transform, by which they have defined \omega(t) instantaneous frequency. Components of decomposition, i.e. IMFs, are thus described in formula c_j (t) = a(t) cos( \phi(t) ), where a(t)>0 is an amplitude and \phi(t) = \int \omega(t) dt is a phase. Representation of a function by multiplication of two other functions is ambiguous, same as number 5 can be presented as 5 = 2 \cdot 2.5 or 5 = 0.2 \cdot 25.
Here is where the Hilbert transformation plays a role. It provide a recipe for representing signal in amplitude times phase functions, but it is still one of the methods. Nevertheless, once we are allowing signal to have modulation in amplitude mode mixing makes little sense. Taking two sinusoids as an example, they can be either in written in form of s_{I} (t) = \sin( \omega_1 t) + \sin( \omega_2 t) or, using trigonometric properties, as s_{II} (t) = 2 \cos(\frac{\omega_1 - \omega_2}{2} t ) \sin(\frac{\omega_1 + \omega_2}{2} t) . In the first form we have two sines with constant amplitudes, whereas in the second one there is a single sine function, of frequency 0.5(\omega_1+\omega_2), which amplitude is modulated by 2\cos(\frac{\omega_1 - \omega_2}{2} t ). Both forms give exactly the same result – the difference is in written representation. It is up to user to decide how he sees it.

A numerical example. Given two sines EMD is behaving choosy on parameters such as a relative amplitude and a relative frequency. Figure 1 presents EMD decomposition on signal S_1 = \cos(15\pi t ) + \cos(39\pi t). It can be seen that two extracted IMFs have exactly the frequencies (number of extrema divided by 2) of which the signal has been synthesised. This is due to the big difference between used frequencies. If one uses close values, i.e. allowing for beating, then the most information is contained in the first IMF (Figure 2). To produce this Figure sines with frequencies 17.5 Hz and 19.5 Hz (S_2 = \cos(35\pi t) + \cos(39\pi t)) were used. Second row of Figure 2 shows the difference between input signal and first IMF – barely visible on the same amplitude scale.

Fig. 1. Input signal for EMD consists of two sines with frequency 7.5 and 19.5 Hz. Those modes have been extracted in IMFs - first contains 19.5 Hz and second 7.5 Hz.

Fig. 1. Input signal for EMD consists of two sines with frequency 7.5 and 19.5 Hz. Those modes have been extracted in IMFs – first contains 19.5 Hz and second 7.5 Hz.

Fig. 2. EMD performed on signal made of two sines with 17.5 and 19.5 Hz frequencies. In first row original signal (blue) and first IMF (green) are overlapping. Second row shows difference between them.

Fig. 2. EMD performed on signal made of two sines with 17.5 and 19.5 Hz frequencies. In first row original signal (blue) and first IMF (green) are overlapping. Second row shows difference between them.

References:
[1] N. E. Huang and Z. Wu, “A review on Hilbert-Huang transform: Method and its applications to geophysical studies,” Rev. Geophys., vol. 46, no. 2, p. RG2006, Jun. 2008.
[2] Z. Wu and N. E. Huang, “Ensemble Empirical Mode Decomposition: A Noise-assisted Data Analysis Method’,” Adv. Adapt. Data Anal., vol. 01, no. 01, pp. 1–41, Jan. 2009.
[3] R. Deering and J. J. F. Kaiser, “The Use of a Masking Signal to Improve Empirical Mode Decomposition,” in Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., 2005, vol. 4, pp. 485–488.
[4] G. Rilling and P. Flandrin, “One or Two Frequencies? The Empirical Mode Decomposition Answers,” IEEE Trans. Signal Process., vol. 56, no. 1, pp. 85–95, Jan. 2008.
[5] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, C. C. Tung, H. H. Liu, and N.-C. Yen, “The Empirical Mode Decomposition and the Hilbert Spectrum for Nonlinear and Non-stationary Time Series Analysis,” Proc. R. Soc. A Math. Phys. Eng. Sci., vol. 454, no. 1971, pp. 903–995, Mar. 1998.

This blog.

It didn’t take long to forget about this blog. Now, hopefully, with revoke of motivation to post more periodically I would like to reactivate it. Content should be related to work I’m doing currently and will try to post at least once a week, probably on Tuesdays.
Yes, this is more a statement for myself. A reminder for future me!
Come on lad!