Data is real. We enjoy the use of real world substantiated and validated data inside our enterprise and consumer software applications every day. Although essentially digital (and often virtualized or abstracted) in its nature, data as a lifeblood is a real reality and it helps us encode the real world, often at the lightning-fast speed of streamed real-time data workflows.

But not all data is real. Some data can be synthetic. This is data that is created to accurately resemble the shape, size, regularity, sensitivity and value range of an information set that we know about, but can’t necessarily get access to for reasons relating to international or local governance stipulations, intellectual property regulations or personally identifiable information concerns.

When Synthetic Data Is Used

As we have explained here in relation to tools such as SAS Data Maker, synthetic data is much-needed for AI analysis when it is related to extremely uncommon events where we still want to perform modeling, such as medical environment analysis of patients with a rare condition, tumor or infection. The fact is then, despite the inconveniently negative name, synthetic data can be a force for extreme good, especially given its role in AI development and machine learning.

But not everybody is a fan. There are some concerns that the overuse of synthetic data is leading to AI model degradation, a sentiment echoed by researchers from the UK’s Oxford University as well as commentators from US-based Rice and Duke universities. Their research suggests that there are risks associated with synthetic data, particularly the potential for “irreversible defects” in models trained predominantly on synthetic outputs. A study published in Nature Magazine also concurs with this suggestion, proposing that over-reliance on model-generated content can cause critical aspects of the original data distribution to vanish, leading to models that fail to accurately represent real-world scenarios.​

In truth, the argument surrounding synthetic data is far more nuanced than that which has been discussed thus far.

What Is AI Model Collapse?

“We can all agree that when AI is trained on information emanating from previous iterations of AI models and their resultant volumes of outputs, it has the potential to propagate errors and introduce noise, leading to a decline in output quality. This creates a self-perpetuating cycle of garbage in – garbage out, reducing the software system’s effectiveness and deviating from human-like understanding and accuracy. This is often called ‘model collapse’ or model autophagy disorder (pleasingly shortened to MAD), where AI systems progressively lose their grasp on the true data they are meant to model,” said Anand Kannappan, CEO of model evaluation platform company Patronus AI.

But Kannappan is balanced, the ex-Meta AI research specialist agrees that although this is an issue we must be acutely aware of, but the answer is not to point the finger at synthetic data. Synthetic data, as it turns out, is absolutely necessary to train models.

As already noted above, while humans generate vast amounts of data, there are scenarios where specific types of data are scarce or difficult to obtain. This is particularly true in fields like medicine, where data scarcity can hinder the development of effective AI systems. Consider the challenge of creating an AI system to detect early signs of a rare genetic disorder. In such cases, researchers might only have access to a limited number of real patient cases, making it difficult to train models effectively.

Curated Custom Datasets

“To build accurate models, data that represents rare events, such as unusual medical conditions, is crucial. These events occur too infrequently to generate sufficient historical data for comprehensive analysis, so this is where synthetic data generation becomes invaluable. It allows AI engineers and software application development teams to create custom datasets with specific properties or distributions, precisely tailoring the data to their needs. This level of control ensures that the models are trained on data that accurately reflects the scenarios they are designed to address,” said Kannappan.

In reality, he says that synthetic data can be cleaner than real-world data, which often contains errors or inconsistencies. By eliminating these issues, synthetic data can lead to more robust models that perform better in real-world applications. Another factor that can’t be ignored is that synthetic data is cost-effective, which is particularly good for startups or organizations with limited resources. Training models from scratch using real-world data can be prohibitively expensive, but synthetic data provides a more affordable alternative.

Synthetic data is a powerful tool that allows engineers to replicate real-world patterns without exposing sensitive information. While most organizations won’t ever train AI models the size of GPT-4 or Llama 3.1 on their own, they will need high-quality data for fine-tuning these models for specific use cases.

From examples witnessed across the Patronus AI customer base, Kannappan and team point to finance as a use case in point i.e. multinational banks can develop advanced fraud detection systems using synthetic data, avoiding the use of actual customer transaction data. This helps enhance security while protecting customer privacy. Insurance companies can also create synthetic datasets to train risk assessment models, improving their underwriting processes without compromising client information.

“Synthetic data also helps organizations comply with strict regulations like GDPR and HIPAA while driving innovation. This compliance is essential in today’s regulatory environment, where protecting data privacy is critical,” said Kannappan. “Synthetic data also promotes enhanced collaboration by allowing organizations to share insights with partners and researchers without disclosing sensitive information. This capability encourages teamwork and accelerates progress across industries while maintaining trust and safeguarding privacy.”

7-Steps To Synthetically Safe

The Patronus AI team provide seven key practices for the safe use of synthetic data.

  1. Organizations should combine synthetic and real-world data in a mix to create training sets that capture real-world variability and simulated conditions, reducing overfitting and improving model robustness.
  2. Firms should establish solid validation processes to regularly validate synthetic data against real-world data to ensure accuracy and avoid introducing biases, maintaining the reliability of trained models.
  3. AI teams should ensure they can get their hands on high-quality real-world data to provide a strong foundation for model training and validation, reducing reliance on synthetic data alone.
  4. It is important to maintain diverse data Sources to incorporate data from diverse demographics and geographies to enhance model generalization, while being mindful of synthetic data’s limitations.
  5. It is important to capture “long tail” data distribution (outliers and uncommon data values) to ensure synthetic data reflects the long tail of user behavior to capture rare cases and extreme scenarios.
  6. AI teams should practice iterative regression testing to continuously monitor model performance as more synthetic data is introduced to catch potential regression.
  7. Finally, for now, AI engineers should perform frequent dataset reviews to regularly assess synthetic datasets to confirm they are high quality and accurately reflect true patterns.

Strategic Congruency

“As more people discover the incredible potential of synthetic data, we’re seeing the perception of the technology shift,” said senior director of advanced analytics at SAS, Susan Haller. “That’s a good thing, because there’s so much value in this approach. One of the reasons we’ve prioritized SAS Data Maker is that it makes synthetic data generation more accessible for all types of businesses – they can generate data that’s ‘statistically congruent’ with real data without resorting to manual collection or buying it from a third party. There’s so much potential with this technology, and what IT leaders are doing now is just the tip of the iceberg.”

Haller says that while the healthcare industry often turns to synthetic data to help with research or train algorithms without risking patient privacy – like our aforementioned tumor or infection research – and other industries are also starting to realize benefits. Manufacturing, for example, uses synthetic data to counter problems when equipment malfunctions, which can cause significant productivity losses or higher costs.

“Synthetic data also helps manufacturers simulate how equipment performs under different conditions, from normal operations to rare failure scenarios. The results enable them to build predictive models that identify problems before they happen, leading to better efficiency and productivity,” explained Haller. “We see new use cases emerging every day, the possibilities are endless and we’re really excited for what the future holds.”

It would appear that synthetic data isn’t the enemy of progress. It’s quite the opposite. It’s necessary in order to innovate. But balance is key and teams will need to mix synthetic data with real-world information, validate rigorously and maintain diverse data sources. This will allow organizations to build, while protecting privacy. Synthetic data, used smartly, can be the path to responsible and effective AI advancement. That doesn’t mean you shouldn’t stop using plastic bags when you go shopping, get a life a bag for life.

Share.
Exit mobile version