What is synthetic data for artificial intelligence?
This article is a cutout of my forthcoming book that you can sign up for here: https://www.danrose.ai/book.
Synthetic data in AI is probably the subject I think the most about currently, to be honest. It has enormous potential to improve privacy, lower bias and improve model accuracy simultaneously in a giant technological leap in the coming years. Gartner even stated, "By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated.". That is a game-changer considering that many people working with AI today haven't even started to adopt this technology.
Synthetic data is data but not actual observations of the world. It is fake data either created by humans or algorithms. It is created artificially or synthetically, but the goal is the same as real data - To represent the world in which the AI is supposed to function. The idea that data for training AI models should accurately represent the world is still a means to the end. Ultimately, the goal of building AI is models that accurately predict to provide a good user experience.
Types of synthetic data
Depending on the data type, text, images and tabular data, there are different approaches and use cases.
Synthetic texts
For language and text AI, you can generate synthetic texts that look like those you would find in the real world. It might even look like gibberish to a human, but if it does the job of representing the world when used for training data, that's good enough.
I have implemented that approach before in a text classification case. I chose this approach because the data could only be stored for three months, making it hard to keep up with seasonal-specific signals. I took the real data that I fed to a language model and fine-tuned the model so that it could produce similar data to real data. We could then generate unlimited data for each label without personal data to train the AI models.
Synthetic images
For images, it's possible to use a text-to-image model that can create synthetic images simply by being prompted by a user with a text. The most famous version of this is NVIDIAs DALL-E 2 model that produces amazingly realistic pictures. An open-source version, available on HuggingFace, called DALL-E Mini, can be tried for free here: https://huggingface.co/spaces/dalle-mini/dalle-mini. You can prompt the model with a short text like "squared strawberry", and you get nine attempts from the model to produce an image of a squared strawberry.
As the model is open-source, you can also download the model and use it for your projects.
The images produced by DALL-E Mini might not be photo-realistic, but it's still good enough to train AI models.
You can try it yourself. Go to the DALL-E Mini and query the model to make images of bananas and apples. Use sentences such as "Banana on table" or "Banana on random background". Do the same with apple until you have 30 or so images of each. You can now upload these images to Teachable Machine to make a banana vs apple recogniser. I promise it will work. If it does not impress you just a tiny bit that you can build AI to recognise objects from purely synthetic images, then I don't know what does.
The use cases here are many. You can synthetically create objects you expect but have not seen in the training data. You can also bring ordinary objects to random backgrounds to make sure you cover unknown scenarios. That will also increase the quality of the models as a change in environment will matter less.
Synthetic tabular data
Tabular data is also possible to generate synthetically. That is popular in healthcare as healthcare is very vulnerable to data issues. Besides the endless combination of scenarios with different diseases and medicine interacting, there's also the privacy issue. Data from one patient's history of diagnostics and medication can be so unique that it can identify individuals. By generating synthetics versions of the actual data, the data can be extended to cover rare scenarios better and anonymise the data. That makes it easy to share between researchers and medical experts.
Models of the world
With synthetic models of the world, we can also experiment with AI solutions before we release them and teach them to become better at a fraction of the cost. Self-driving cars are a perfect use case for this. Self-driving cars can be developed faster and safer by building a synthetic model of the world close to the real world with physics and random scenarios. Many companies building self-driving cars today use models built in the engine Unity, initially intended for computer game development. Cars can try, crash and improve with no humans at risk in a virtual world millions of times before being released.
The good and the bad of synthetic data
The benefits of applying synthetic data to your solutions are many. It can provide more data at a lower price to improve the accuracy of models. It can remove bias by evening out the data by adding to the otherwise rare features or labels that would be a disadvantage for some groups. It can also improve the privacy of people whose personal data might be part of the training data. It can also let us test known and unknown scenarios.
But is it all good? No. Synthetic data is not a silver bullet. It does come with the risk of adding to bias or bringing the data further from the world it is meant to represent. The challenge is that it is difficult to identify the cause of bias as synthetic data is often used where the real data is in shortfall and, by definition, challenging to reality-check. Synthetic data is a promising solution to many problems, but use it with care. As very few have experience in synthetic data in AI, we are unaware of many of the challenges that await.
For more tips, sign up for the book here: https://www.danrose.ai/book.