Drop-In Class #18: Synthetic Data
Welcome to my newsletter, which I call Drop-In Class because each edition is a short, fun intro class for technology concepts. Except unlike many instructors, I'm not an expert yet: I'm learning everything at the same time you are. Thanks for following along with me as I "learn in public"!
Why synthetic data matters right now
In my ongoing quest to monitor all the big AI trends, I’ve noticed quite a few synthetic data startups popping up, like Synthesis AI, Gretel, and more. When I dug in more, I saw this stat: the number of startups focused on synthetic data went from fewer than 10 in 2017…to more than 50 in 2022.
If there are so many companies dedicated to solving a problem, it’s a problem worth looking into. And it’s estimated to be a $2.34 billion problem by 2030.
So what is that problem, exactly? Why do you need synthetic data? All shall be revealed. In the spirit of being synthetic, this week’s song pick is very synth-y. We just can’t get enough quality data!
AI models need high-quality datasets, but that’s easier said than done
Everyone’s a broken record on this. An AI model is only as good as the data that goes into it, garbage in, garbage out, blah blah. You know the drill.
Except it’s actually pretty hard to get good data, for a few reasons like:
It’s unusable due to security or privacy regulations. There are risks associated with using real data, like compliance issues or data breaches.
It can be imbalanced or incomplete, since real-world data is happenstance — you might not see all the possible events or trends that could occur, like a fraud technique that exists but hasn’t been used on your org yet.
It might be messy or even inaccessible, even if it’s your proprietary data. Many, many orgs still can’t get a handle on their data, and it’s hidden in silos or lousy with data quality issues like missing values.
What’s the solution when you can’t do real data? Generate fake data!
But fake data isn’t that simple, either. A sample dataset might be too unrealistically cleaned up, or it won’t show real-world statistical correlations. For instance, a dataset that shows ice cream sales increasing in cold weather.
I’ve experienced this pain myself when creating demos. Usually the hardest thing about creating a realistic demo is finding realistic data to use. Kaggle datasets only get you so far.
Synthetic data: Improving on the real thing
If data is oil, synthetic data is like synthetic car oil: engineered to improve on the real thing.
Synthetic data is artificially generated data that mimics real-world patterns, without exposing sensitive information. It used to be seen as a poor substitute for real data, but we’ve gotten better at creating realistic synthetic datasets thanks to AI. It’s generated using algorithms and models instead of being collected in real-world observation.
So yes — it’s AI generating data for AI. Another funny irony where AI is both the problem and the solution!
In fact, the leading LLM developers use synthetic data. Anthropic used synthetic data to train Claude 3.5 Sonnet, and Meta used synthetic data to fine-tune Llama 3.1 405B.
A few business scenarios where synthetic data comes in handy:
For understanding a new or rare “what if” scenario: Synthetic data is useful when the real data doesn’t exist yet. For instance, if you’re a self-driving car company, you’re (hopefully) not frequently gathering real data from real crashes. If you’re a bank, you might want to understand a fraud attack pattern that hasn’t happened to your bank yet.
For quickly testing and iterating: You can prototype new use cases faster when it’s easier to generate the data needed.
For customizing a model to your specific domain: LLMs in particular come pre-trained on general information. Synthetic data can provide more specific, personalized context for RAG or fine-tuning models.
Synthetic data is especially helpful in healthcare, finance, retail, and other industries with unique needs and constraints, like privacy regulations and rare scenarios. I love this use case example: Medical researchers can generate synthetic MRIs, CT scans, and X-rays to train diagnostic AI models without exposing patient records.
What’s the catch? As with anything AI, you need a human in the loop to make sure the synthetic data actually is representing the real world. Synthetic data might miss nuanced patterns or edge cases that only appear in real data collection. And if the AI models generating the synthetic data are hallucinating, those hallucinations will show up in the data. You still need domain experts and data scientists involved for quality control and implementation.
Also, there are potential issues with the whole “AI generating data for AI” thing. The models are being trained on synthetic data, and then generating more synthetic data to train other models. It could lead to a vicious cycle that causes model collapse (models forget about the underlying data distribution and eventually degrade).
With those risks in mind, it’s still looking like synthetic data is going to be an important part of the AI workflow. As AI investment keeps increasing, so will the demand for data that makes AI applications work, and more organizations will turn to synthetic data to help them keep things moving.
The cooldown: Extra reading
This is a 101-level drop-in, so I keep things at a high level! For deeper reading, check out these articles.
Is Synthetic Data the Future of AI? (Gartner)
Could Synthetic Data Be an Answer to Generative AI’s Data Problem? (WSJ CIO Journal)
The promise and perils of synthetic data (TechCrunch)
See you in the next one!
Cheers,
Alex