From Bytes to Bedside: Improving Real-World Applications with Synthetic Data

Imagine walking into a busy hospital where every interaction, from the moment a patient arrives to the final follow-up, is meticulously recorded and immediately analysed. Despite the hurried footsteps, beeping monitors, and hushed conversations, everything is running smoothly, based on data optimisation. But there is a twist: none of the patients, doctors, or nurses are real. Instead, this hospital is a simulation populated entirely by Artificial Intelligence (AI) agents.

This scene is not the plot of a science fiction novel; it is the description of a cutting-edge research project recently conducted at Tsinghua University in China. Here, researchers have created a fully simulated hospital environment and used generative AI-based models to populate it with virtual patients and medical staff. The implications of their successful project go beyond the medical field, as it highlights the immense potential of so-called synthetic data to improve the performance of real-world AI. The simple fact that AI can be improved by training in synthetic, controlled environments before tackling the complexities of the real world offers some lessons for European businesses and policymakers to consider.

A New Kind of Training Ground

The project at Tsinghua University is called “Agent Hospital” and simulates the entire medical process, including disease onset, triage, registration, consultation, medical examination, diagnosis, treatment, convalescence, and post-hospital follow-up. The primary goal is to enable medical agents to learn how to treat diseases in this virtual environment, using large language models (LLMs), such as the one powering the popular Chatbot “ChatGPT”. However, as in many real-world sectors currently trying to implement new AI methods, a key problem is the lack of data, which typically needs to be labelled in order to train an algorithm to predict, for example, a disease or the success of a medical treatment.

To get around this problem, the Chinese researchers developed a strategy called “MedAgent-Zero”, which allows the virtual medical agents to improve their skills without relying on manually labelled data. In this, they were inspired by the famous AlphaZero programme developed by DeepMind, which achieved an almost unimaginable level of chess playing using only so-called reinforcement learning and self-play in order to train its neural networks. By relying on LLM-powered agents to perform a similar feat, the virtual doctors staffing “Agent Hospital” could gain experience from both successful and unsuccessful cases and continuously refine their skills. In practice, this means that these simulated physicians were able to handle tens of thousands of simulated cases in just a few days – an equivalent learning experience that would take real doctors several years to achieve.

By simulating disease onset and progression based on predefined “knowledge bases” and LLMs, the virtual hospital enables medical agents to effectively acquire and apply medical knowledge. Over time, the AI doctors demonstrated significant improvements in their performance, achieving state-of-the-art accuracy in various medical tasks. For example, after treating around ten thousand patients, the evolved doctor agent achieved 93.06% accuracy on a subset of the MedQA dataset (a popular medical dataset for machine learning research collected from the professional medical board exams) covering major respiratory diseases. The success of the Tsinghua project thus lies in its ability to create a realistic and scalable platform for medical training. As this innovative approach not only improves agent performance in the simulated environment but also translates into better outcomes in real-world medical benchmarks, it raises the obvious question of whether similar approaches could be used to advance AI in fields other than healthcare.

Beyond Simulated Hospitals: The Broader Implications

Agent Hospital is just one example of how synthetic data can transform AI training. The underlying principle is simple: synthetic data, when generated and validated correctly, can provide a vast, diverse, and cheap training set that helps AI models generalise better to real-world scenarios. Synthetic data can be generated through various methods. For instance, hand-engineered methods involve using expert knowledge to identify underlying distributions in real data and then imitating them. This method ensures that the synthetic data closely mirrors the real-world scenarios it aims to replicate. Agent-based models create synthetic data by allowing predefined agents to interact according to specific rules. Over time, these interactions generate distribution profiles similar to those found in real datasets. Generative machine models, which are considered state-of-the-art, learn how real data is generated and then produce synthetic data by sampling from the learned distributions.

A good illustration of synthetic data generation and its transfer into real-world applications is RoboCasa, developed by researchers at the University of Texas at Austin and NVIDIA. This software simulates home environments to train home robots, with different kitchen environments and a variety of virtual objects. To achieve this, the researchers designed thousands of 3D assets across more than 150 object categories and dozens of interactable pieces of furniture and appliances. Subsequently, they enriched the diversity of this dataset with generative AI tools, such as object assets from text-to-3D models and environment textures from text-to-image models. Crucially, the integration of AI-generated textures significantly increased the visual diversity of the training data, improving the robots’ ability to generalise to real-world tasks. Robots trained on larger datasets generated within these simulated environments outperformed those trained on smaller datasets.

The implications are profound. With synthetic data, it is possible to bootstrap AI systems beyond the limitations of natural data distributions, making scaling industrial AI more cost-effective and efficient, and ultimately contributing to real-world advances like the production of smart and capable robots. For example, Google researchers recently presented a semi-automated way to create accurate and hyper-detailed image descriptions for training vision language models, going beyond the current generation of web-scraped descriptions, which are short, low granularity, and often contain details unrelated to the visual content. More broadly, multimodal AI models have helped robotics leap forward by making it more real-world based. A robot trained on such models can, as The Economist summarised, “hold spoken conversations, recognise and manipulate objects, solve problems and explain its actions”. Tellingly, OpenAI has recently announced a partnership with a humanoid robot company.

Notably, synthetic data might also mitigate privacy concerns when sharing or using data for AI training, helping to comply with data protection regulations. This might be particularly needed in Europe, as recent research has suggested that the strict rules of the GDPR have led to EU firms decreasing data storage by 26% and data processing by 15% relative to comparable US firms and thereby becoming less “data-intensive”. As big data-based AI continues to evolve, synthetic data could therefore play a crucial role in strengthening Europe’s digital competitiveness by enabling scalable, cost-effective, and legally secure training of AI models and pushing the boundaries of what these systems can achieve.

Remaining challenges

However, the use of generative AI models to produce synthetic data raises certain ethical, privacy, and legal considerations. As the New York Times’ lawsuit against OpenAI highlighted, these models sometimes produce information that is extremely close to the original training data. Similarly, the AI-centred search start-up Perplexity was found to be republishing parts of exclusive newspaper articles. Ensuring that synthetic data does not inadvertently reveal sensitive information or introduce biases is therefore crucial. So-called differential privacy techniques, which add noise to the data generation process, might help mitigate these risks by providing mathematical guarantees against inferential attacks. The key idea here is that the presence or absence of any individual record in the dataset should not significantly affect the outcome of the mechanism.

How can new ways of generating synthetic data be used in AI training while preserving privacy and mitigating bias? First, the computer science community can help practitioners and stakeholders identify the use cases where synthetic data can be used safely, perhaps even in a semi-automated way, and how to implement privacy enhancing technologies, such as differential privacy. Promisingly, researchers have created a reliable framework for detecting whether an LLM has been trained on a particular dataset, which could be used in red-teaming exercises to protect against litigation.

In addition, however, European policymakers also have a role to play in clarifying how companies can use more synthetic, or anonymised, data for AI training and manage future re-identification risks. Our recent survey of 1,000 European businesses showed that there is a need to clarify how synthetic data can be used in a legally secure manner and when data can be considered anonymised, for example by developing and defining uniform and workable standards. At the very least, EU data protection authorities should therefore provide workable guidelines for the generation and processing of synthetic data, prioritising the principles of transparency, accountability, and fairness. This will help to understand the types of data, tasks, and settings in which appropriate privacy-utility trade-offs can be achieved through synthetic data produced by generative models.

Conclusion

From simulated hospital beds to kitchen robots, synthetic data is revolutionising the way AI systems are trained and deployed. The new ways of creating synthetic data that have recently become available thanks to LLMs and other types of generative AI could transform the process of machine learning and herald an opportunity for European industry to catch up with US and Chinese competitors. This is especially true for smaller firms, which might not otherwise have the capacity or financial power to acquire and label large datasets for AI training. However, as Europe moves forward with this technology, it is imperative to balance innovation with ethical considerations to ensure that synthetic data serves as a powerful tool for the betterment of society. For this, the industry needs clear guidelines and consistent and workable technical standards.

Anselm Küsters is Head of Digitalisation/New Technologies at the Centrum für Europäische Politik (cep), Berlin. As a post-doctoral researcher at the Humboldt University in Berlin and as an associate researcher at the Max Planck Institute for Legal History and Theory in Frankfurt am Main, he conducts research in the field of Digital Humanities.

Küsters gained his Master’s degree in Economic History at the University of Oxford (M. Phil) and his PhD at the Johann Wolfgang Goethe University in Frankfurt am Main.