Machine learning is advancing rapidly, and ever growing data is at the center of this evolution. But what do you do if you don’t have a massive dataset to start with? This is the first of two talks we held at a machine learning meetup jointly hosted by Telia and Squeed, based on real projects where data was limited. In this talk the topic was object localization in a scenario where all the data available was that which we could collect ourselves.
So what is the low data regime exactly?
Good question! Computer vision is perhaps the area where machine learning has made the greatest progress during the last ten years. An important factor here are the increasingly big data sets. Just in the beginning of the year, Tencent published a dataset for computer vision comprised of 18M multi-categorically labelled images. In contrast, we were looking to develop a model with a much smaller dataset on the order of $10^3$. It is not one-shot learning, and historically a dataset of this size might have been considered even to be relatively big. In the setting of modern computer vision though, we believe a dataset of this size certainly qualifies as a citizen of the low-data regime.
The project in question involved annotating news broadcasts with the text from the kind of boxes seen in the picture below, as well as hardcoded subtitles. In order to do this we use a three stage pipeline of textbox localization > OCR > parsing. We were not interested in text in the background of the video or persistent text such as logos or timestamps.
The case for semantic segmentation
Nowadays, computer vision almost always means deep learning using some form of convolutional neural network. Given our task, it seems natural to directly predict the bounding boxes of the textboxes. Instead, we opted for a semantic segmentation approach where we predict the likelihood of belonging to a textbox for every pixel, and then through some closing-clustering post-processing obtaining the desired boundary boxes. Why? Models for boundary box prediction (or object localization) typically deal with the added complexity of superimposed objects. In cases where one object might be partially obscured by another, this capability is a clear advantage over semantic segmantation. In our case however the objects of interest are always visible, and we found that the approach using segmantic segmentation was both faster and more performant.
Techniques for low data computer vision
Some of the techniques we looked at to compensate for our lack of data included:
- Limiting the representational capacity of the model
- Pretraining on synthetic training data
- Data augmentation
We also used a an iterative annotate-train-predict cycle to speed up the data labelling procedure.
Limiting the representational capacity of the model
Rather than using a ResNet 152 as the basis of our model, we choose an architecture inspired by MobileNetV2. To deconvolute our intermediate latent representation back to the original image dimensions we used a DeepLabV3-inspired architecture. With a small dataset the representational power of a bigger model most likely cannot be utilized to its full extent, while incurring costs in time taken to run training and inference. It is also possible that a smaller representational capacity provides some insurance against overfitting.
Pretraining on synthetic training data
For our case specifically it seemed feasible to generate synthetic data which distribution should be similar enough to the true distribution of the data. Simply take a random image, sample some string from a set of company names or headlines, and construct a textbox on top of the original image.
Starting from a model that had been pretrained on this data, while better early on during training, in the end proved no better than a model trained directly on the real data.
Another technique we looked at was data augmentation, where various small perturbations are made to the real data. In computer vision, data augmentation is hardly limited to low data scenarios, but it is reasonable to believe that these scenarios are the ones that would benefit the most from it. We limited the perturbations to flips (horizontal, vertical, or both) and a couple of non-linear intensity transformations of the grayscale input images. A similar graph comparing the loss during training for a model trained with augmentation and one trained without shows a very modest decrease in the generalization gap.
Speeding up data labelling
The size of our dataset was bounded by the number of images we were willing to annotate ourselves, and as such the speed with which we could annotate was of high importance. In order to improve the annotation process we used an iterative annotate-train-predict cycle, where we would use the current model to provide predictions for the images we want to annotate, thus only requiring us to correct the mistakes rather than start from scratch.
We found validation set loss to be a good proxy for performance of the model, and have therefore not felt the need to use other metrics such as pixel F1 or boundary box mean average precision. An example of the trained model on a video from the test set can be seen below.
Despite the fact that pretraining on synthetic data and data augmentation did not significantly improve our model, it works quite well. How do we consolidate this observation with our opening statement, where we compared our $10^3$ datapoints to
Tencents $1.8 \times 10^7$?
- We only have one class to predict. Though Tencents dataset still contains more datapoints per class, the disparity is not as big as it initially seems
- We simplified the learning problem as much as we could, by only predicting a binary mask over the input image
- We do not require our model to achieve superhuman performance
It is important to remember that there are diminishing returns when trying to improve a certain model through increased dataset sizes. Often the deciding factor when it comes to the feasability of a certain project is where these diminishing returns starts to really kick in, in relation to where the model achieves a satisfactory performance. Hence it is worth evaluating even when you only have a small amount of data, because sometimes that is all you need.