Top 5 ways to make better AI with less data
1. Transfer Learning
Transfer learning is used a lot in machine learning now since the benefits are big. The general idea is simple. You train a big neural network for purposes with a lot of data and a lot of training. When you then have a specific problem you sort of “cut the end off” the big network and train a few new layers with your own data. The big network already understands a lot of general patterns that you with transfer learning don’t have to teach the network this time.
A good example is if you try to train a network to recognize images of different dog species. Without transfer learning you need a lot of data, maybe a 100.000 images of different dog species since the network has to learn everything from scratch. If you train a new model with transfer learning you might only need 50 images of every species.
You can read more about Transfer Learning here.
2. Active learning
Active learning is a data collection strategy that enables you to pick the data that your AI models would benefit the most from when training. Let’s stick with the dog species example. You have trained a model that can differentiate between different species but for some reason the model always has trouble identifying the german shepherds. With an active learning strategy you would automatically or at least with an established process pick out these images and send them for labelling.
I made a longer post about how active learning works here.
3. Better data
I’ve put in a strategy here that might sound obvious but is sometimes overlooked. With better quality data you often need way less data since the AI does not have to train through the same amount of noise and wrong signals. In the media AI is often talked about as “with a lot of data you can do anything”. But in many cases making an extra effort to get rid of bad data and make sure that only correctly labeled data is used for training, makes more sense than going for more data.
4. GAN’s
GAN’s or Generative Adversarial Networks is a way to build neural networks that sounds almost futuristic in it’s design. Basically this kind of neural network is built by having two networks compete against each other in a game where one network creates new fake training data examples from the data set and the other is trying to guess what is fake and what is real data. The network building fake data is called the generator and the network trying to guess what is fake and what is real is called the discriminator. This is a deep learning approach and both networks keep improving during the game. When the generator is so good at generating fake data that the discriminator is consistently having problems separating fake from real we have a finished model.
For GAN’s you still need a lot of data but you don’t need as labelled data and since it’s usually the labelling that is the costly part you can save time and on your data with this approach.
5. Probabilistic Programming
One of my very favorite technologies. Probabilistic programming has a lot of benefits and one of them is that you can often get away with using less data. The reason is simply that you build “priors” into your models. That means that you can code your domain knowledge into the model and let data take it from there. In more many other machine learning approaches everything has to be learned by the model from scratch no matter how obvious it is.
A good example here is document data capture models. In many cases the data we are looking for is obvious by the keyword to the left of it. Like “ID number: #number# is a common format. With probabilistic programming you can tell the model before training that you expect the data to be to the right of the keyword. Many neural networks are taught from scratch requiring more data.
You can also read more about probabilistic programming here: https://www.danrose.ai/blog/63qy8s3vwq8p9mogsblddlti4yojon