This article was originally published by Ta-Ying Cheng on Towards Data Science.
Ta-Ying Cheng, an Oxford D.Phil. student in CompSci, 3D Vision, and Deep.
The exponential number of research and publications have introduced many terms and concepts in the domain of machine learning, yet many have degenerated to merely buzzwords without many people fully understanding their differences.
This article demystifies the four core regimes in the field of machine learning — supervised, semi-supervised, unsupervised, and self-supervised learning — and discusses several examples/methods in solving these problems. Enjoy!
The most common, and perhaps THE type that we refer to when talking about machine learning is supervised learning.
In simple words, supervised learning provides a set of input-output pairs such that we can learn an intermediate system that maps inputs to correct outputs.
A naive example of supervised learning is determining the class (i.e., dogs/cats, etc) of an image based on a dataset of images and their corresponding classes, which we will refer to as their labels.
With the given input-label pair, the current popular approach will be to directly train a deep neural network (i.e., a convolutional neural network) to output a label prediction from the given image, compute a differentiable loss between the prediction and the actual correct answers, and backpropagate through the network to update weights to optimise the predictions.
Overall, supervised learning is the most straightforward type of learning method as it assumes the labels of each image is given, which eases up the process of learning as it is easier for the network to learn.
While supervised learning assumes the entire dataset to be trained on a task has the corresponding labels for each input, reality may not always be like this. Labelling is a labour-intensive processing task and often input data comes in unpaired.
Semi-supervised learning aims to address this problem: how do we use a small set of input-output pairs and another set of only inputs to optimise a model for a task that we are solving?
Referring back to the image classification task, image and the image labels now only exist partially within the dataset. Is it possible to still utilise the data without any labelling?
Short answer, yes. In fact, there is an easy trick called pseudo-labelling to do this. First, we use the images with correct labels to train a classification model. We then use this classification model to label the unlabelled images. Images with labels of high confidence from model will then be added to the model with their predicted labels as pseudo-labels for continued training. We iterate this process until all the data are utilised for the best classification model.
Of course, this method while seemingly smart, may easily go wrong. If the number of labelled data is very limited, it is very likely the model overfits on the training data and give false pseudo-labels at an early stage, leading to the entire model being completely wrong. It is thus also very important to decide the confidence threshold to include a input-pseudo-label pair into training.
To avoid the model overfitting at an early stage, one could also adopt data augmentations techniques to increase the size of training and creating a wider distribution of data. If interested, you may also refer to my article on mixup as one of the most predominant augmentation strategy for image classification tasks.
Now that we understand how to use minimal labels for training, we can think one step further: a dataset with no labels at all.
Unsupervised learning is the at other end of the spectrum, where only input data have no corresponding classifications or labelling. The goal is to find underlying patterns with each dataset.
Tasks involving unsupervised learning include customer segmentation, recommendation systems, and many more. However, how does one learn anything without any labels?
Since we have no ‘correct answer’ to each input label, the best way we can somehow find a pattern is to cluster them. That is, given a set of data features, we try to find features that are similar to each other and group them together. Some clustering methods include K-Means and K-Medoids methods.
Merely clustering may actually create numerous insights. Take the example of a recommendation system: by grouping the users based on their activities, one can recommend contents one user favours to the other without explicitly understanding what the interests of each user is.
Now comes to the tricky bit. It seems like we have covered the entire spectrum of learning, then what in the world is self-supervised learning!? Well, the answer may be simpler than you think!
Self-supervised learning is in some sense a type of unsupervised learning as it follows the criteria that no labels were given. However, instead of finding high-level patterns for clustering, self-supervised learning attempts to still solve tasks that are traditionally targeted by supervised learning (e.g., image classification) without any labelings available.
This may seem impossible at first glance, but numerous recent research have came up with creative and interesting techniques to learn this, one of which being the infamous contrastive learning from positive and negative pairs.
In short, one perform augmentations on the same images and label them as a positive pair, different images as negative pair, and attempts to push the learnt features of negative pairs away while dragging positive features close. This enables the network to learn to group images of similar classes, which further make tasks like classifications and segmentations that originally required fixed labels to learn become possible without given ground truths.
If you want to personally experience individual concepts, you may retrieve any datasets and personally remove partial or all labels to test out each learning methods.
A thing to note is that if you directly retrieve the dataset from torchvision, they already have pre-defined the labels for you, and downloading and writing your own dataloader will be better if you want to test out semi-supervised or self-supervised learning. One place that I have found to be particularly useful for retrieving datasets is Graviti Open Dataset. With so many different datasets and papers, it is often tiring to understand what dataset to use. The Open Dataset platform organises all the popular datasets so that you could easily find them and be redirected to their official websites. They are also currently working on providing a good API to help ease up the process of designing data loaders which I believe will be great for future use.
I would personally recommend to test on MNIST or CIFAR-10 datasets that requires less computational power. They would be easier to train if only a laptop is available for you.
For more papers and techniques in self-supervised learning, you may also refer to papers from paperswithcode that includes the current benchmarks and state-of-the-art methods.
And there you have it! Hopefully after this article you will know the subtle differences between supervised, semi-supervised, unsupervised, and self-supervised learning. Have fun and keep learning!