Active Learning - The AI data strategy that will save you time and money

As always my post will be from an applied perspective and how different techniques in AI and machine learning can be used for real applications. Active learning is a data strategy that my mind is directly aiming for making AI more applicable in the real world. An essential element to make AI work in the real world is to make it a good business case. The investment in not just the development but also in the recurring collection and preparation of data to feed the AI, must be outweighed by the revenue. If collecting data either initially or recurring is too expensive that becomes difficult to achieve. Active learning is a way to reduce those costs.

When working with AI we often use labeled data. If you are building AI that can translate speech to text then you will have to label hours and hours of speech into text. That’s costly and hard work. Even worse - Language changes over time and as your AI expands to in use other accents will enter the domain of the AI and you will need more data. So finding a strategy to label as little as possible data and still get good results is key.

Active learning definition

Active learning is a data strategy in AI and machine learning and AI that cherry picks data for labelling in order to get the most relevant data labeled at any given time. It’s a proactive data collection method contrary to classical machine learning data collection that advises random data collection to avoid bias.

The concept is also known as optimal experimental design in statistics in case you come across that term.

Implementing Active Learning

In short the strategy is to implement an iterative loop that feeds data to your model as you get smarter on what data you need the most. So basically it’s about consistently getting smarter and knowing what data would be most valuable to label. 


active-learning.png


As you can see the system is simple. You have an AI model that can give predictions. These predictions are taken in by the active learning model that selects data unlabelled data that should be labeled. When the AI model retrains you have better data given. The effect of course depends on the Active Learning Model and your available data pool.

The active learning model will have a way to decide what data is the most appropriate to train on. That is called the query selection strategy. We’ll get back to that later. 

Active learning comes with a few standard architectures. The most popular being Pool-based and stream-based active learning.

Pool-based active learning

This architecture works well if you train a comprehensive one time model. Let’s say you want to make a model that can classify images of all animals into species. New species don’t just come around all the time so the need for recurring training is small. There are a lot of species of animals out there though so there will be a need for a lot of labeled data. 

Now let’s say you have a pool of 1 million unlabelled images of animals. In the pool-based approach you will select a small sample of this, say 10,000 - and label them. After labelling we train the model and start making predictions. The Active Learning model now selects new unlabelled data for labelling. An example could be that the model decides for more labeled images of elephants since this where it’s having trouble.

This approach is taking in iterations and the model is retrained with the selected new data until you get the results you wanted (Or you spend the labelling resources allocated).

Stream-based active learning

The stream-based architecture is more of an in-production solution. The solution works when you have an on-going stream of classifications that should be improved. The idea is that you let the active learning model pick out examples for learning in real-time. 

Let’s say you have an AI-model that predicts if emails are spam or not. In this case the active learning model can on-the-go ask for labeling on emails that were classified with a high uncertainty. You could have a similar solution with the speech-to-text problem by having a team ready to type in speech that were not confidently classified.

The cool thing here is that it keeps the model improving along with the changing domain of the AI-model. The downside is that someone has to do this work and you will have to figure out a threshold to keep the costs or the inconvenience of the user down.

Query selection

As mentioned the query selection is the actual strategy for what data is selected for labeling. There’s a lot of strategies to choose from. I’ll just go into one - The uncertainty sampling - but there lot’s to choose from if you investigate further.

Uncertainty sampling

The idea is simple. When most machine learning models classify they also provide you with a confidence score. So provided a picture a model could provide a score of 82% that the picture contains an elephant. 

The uncertainty sampling strategy simply advises to choose the classifications with the lowest score for labelling. The solution has shown to be highly effective though simple.

I’ll just add a little nuance to uncertainty sampling. Being a big fan of Probabilistic Programming I see an alternative version. I haven’t seen it tested anywhere but so this is pure thought experiment. In probabilistic programming you do not get a confidence score but instead a confidence interval or distribution. That means that you don’t just get a score but also how sure that score is. My idea here is to pick the classifications with the most uncertain distributions. This will provide more precise predictions. If building a model with a need for high recall or precision then this might be a suitable way to go.

The unsupervised approach

I couldn’t find a name for this anywhere or actual cases but this has sort of been a strategy I have worked with in practise before, so I just want to share this.

In many AI architectures a mix of supervised learning(label needing) and unsupervised learning(not label needing) is used. Often unsupervised learning is used to group or cluster entries before a specialized supervised model trained on each cluster is used to classify the entry. In this approach it seems obvious to also look at what clusters need more data. The idea is to get more data for the clusters with the smallest amount of labelled data. I have seen this work at least.

Challenges to Active Learning

There of course are some challenges to Active Learning. They don’t outweigh the benefits but you have to keep them in mind.

 

Risking bias

First of all you are risking getting more biased data. In my opinion all data is always biased. The trick is to get as little bias as possible. By using the default machine learning approach and selecting random data for labelling you avoid a lot of bias. I don’t see this is a big risk but have it in mind.

Who asked the users?

Active learning also misses in my opinion a crucial element in the perspective of business application. It doesn’t take into account what cases are the most needed from a user perspective. Let’s say that the Active Learning Model decides that we need more labeled images of polar bears since we get a really low confidence here. But from a user perspective that doesn’t matter since we very rarely see polar bears. So if a polar bear on average is classified with 80% confidence and dogs with 90% confidence we don’t get to train a lot on dogs now. But if dogs are the case 95% of the time we might want a bit more dog training. 

Cooperative learning

There’s a sub-strategy to Active Learning called Cooperative learning. In this approach you let the AI suggest labels from it’s classifications to humans that can in turn validate the label as correct or incorrect. I have used this strategy before and it can be extremely efficient when done right. When the person tasked with labelling only has to see the data and click a button to label that person can suddenly label a lot of data in no time. This does have it’s problems though. I have seen in some cases when the AI reaches a certain level of quality it can be too easy to trust the AI if there’s just a tiny bit of doubt from the person labelling. This can quickly mud the data and stall the progress of the AI. So if you choose to use Cooperative Learning you should also deploy strategies to avoid this. It could be random control checks that make sure you catch bad habits.

Conclusion

Active Learning is a cost-effective way of finding out what data you need to label the most. There’s different architectures and approaches to choose from depending on what kind of solution you are trying to build. Building the Active Learning Model is where the real work - and fun - lies in this data strategy.

Previous
Previous

Artificial intelligence is as easy as rocket science

Next
Next

Probabilistic Programming - The promising new AI technology from an applied perspective