Unsuppervised learning in a supervised problem

Dawid Laszuk published on
3 min, 492 words

Unsupervised learning is a powerful technique in machine learning, particularly when dealing with datasets where labeled data is unavailable. It allows the model to identify patterns, structures, and relationships in the input data without any human supervision or guidance. This can be particularly useful for tasks such as categorizing images, identifying anomalies in time series data, or detecting outliers in graphs.

There is a common misconception that unsupervised learning methods, such as K-means or Isolation Forest, cannot be validated. While it may be difficult to validate the method itself, validating the results of the method and ensuring that they provide acceptable outcomes is rarely challenging. There may not be an objective truth, but there are plenty of subjective "yeah, that looks good." Humans are adept at spotting patterns and can often evaluate the quality of results simply by inspecting them. In fact, many algorithms were created to match our intuition on how a problem could be solved.

Visual inspection is a great start, but it isn't scalable. In enterprise settings, we often work with larger datasets that can change frequently and we cannot validate each dataset by hand/eye. In such settings, we can focus on finding a representative subset and verify results through the tried and true method of "I'll know it when I see it." What exactly this means depends on the problem and solution. In the case of clustering, it means ensuring that "clusters make sense," while in anomaly detection, it involves confirming that "it stands out."

These are very vague phrases, which is why data science exists. It's about quantifying what it means and describing it in algorithmic/mathematical terms. For example, in case of anomaly detection, this could be devising a distribution for normal samples and verifying that each new anomaly is an outlier. There could be many different types of outliers, so it might be required to cluster them and assess how collectively different they are from normal populations' distributions. In case of time series with somehow frequent outliers, one could also observe their changes and classifications over time and check whether their properties differ.

That's a long way of saying there are many ways to validate results and compose validation metrics. Such metrics don't need to be perfect, but as long as they span a vector towards the direction we want the method to behave, then we can improve them iteratively. In the end, it's unlikely that the first method is the perfect method; we might not have representative data to start with, or we might realize the solution should be something else completely. There are always trade-offs. However, since the problem appeared and there was a need to come up with a solution, there should also be a gauge to tell how good we are doing. It's up to people who look into the data and design solutions to assess how good the solution is and whether there need to be improvements.