Home Media News
News March 14, 2025 4 min read
Synthetic Data in Oncology: Can It Replace Real Slides for Training?

Synthetic Data in Oncology: Can It Replace Real Slides for Training?

As artificial intelligence (AI) continues to advance the field of oncology, the need for large, diverse, and high-quality data has never been more pressing.

P
PAICON
From Data to Diagnostics
Cancer Research Digital Pathology Real-World Data Synthetic Data
Share:

As artificial intelligence (AI) continues to advance the field of oncology, the need for large, diverse, and high-quality data has never been more pressing. In digital pathology, whole slide images (WSIs) are essential for training AI models that support cancer detection, classification, and prognosis. But acquiring annotated WSIs at scale is costly, time-consuming, and often limited by privacy regulations. This is where synthetic data has entered the spotlight, promising to fill gaps where real data is …

What Is Synthetic Pathology Data

Synthetic pathology data refers to computer-generated images that mimic real histopathological slides. These can be produced through various methods, including:

  • Generative Adversarial Networks (GANs)
  • Image augmentation techniques
  • Diffusion models
  • Simulation-based frameworks

The idea is to either generate entirely new, plausible tissue structures or enhance existing datasets by introducing controlled variability. The potential? Addressing class imbalance, anonymizing sensitive patient data, and reducing the burden of manual annotation.

Banner Image

Use Cases: Where Synthetic Data Adds Value

Balancing Rare Classes

In cancer research, some tumor subtypes or morphological patterns are underrepresented in datasets. Synthetic slides can help balance training sets by creating additional images of rare findings, potentially improving model sensitivity without needing more patient data.

Pretraining and Data Augmentation

Synthetic images are often used to pretrain models before fine-tuning them on real data. This strategy helps the model learn general features of tissue morphology, accelerating convergence during training.

Data Privacy and Federated Learning

Synthetic datasets can be shared more freely than real patient data. This opens up opportunities for multi-institutional research, collaborative development, and federated learning initiatives without compromising patient confidentiality.

The Current Limits

Despite its promise, synthetic data comes with caveats.

  • Domain Gap: Synthetic images often lack the nuanced complexity and artifacts found in real slides. AI models trained only on synthetic data may fail to generalize when applied to clinical data.
  • Annotation Quality: Synthetic data might not come with expert annotations or biological ground truths, making supervised training less effective.
  • Bias Amplification: If the synthetic generation process is based on a limited or biased dataset, it can replicate and even exaggerate those biases.
  • Lack of Regulatory Acceptance: Clinical-grade AI tools must be trained and validated on real-world data. Synthetic data may assist the process, but it cannot replace real-world validation.

The Role of Real Data Is Still Foundational

While synthetic data is a valuable tool in the oncology AI toolbox, particularly for augmentation, experimentation, and privacy-conscious research, it is not yet a replacement for real-world pathology data. Clinical-grade AI demands rigorous training, testing, and validation using inclusive, representative data that captures the complexity of human biology.

At PAICON, we believe in the power of data—real-world data. Our AI models are built on a foundation of diverse, high-quality whole slide images from globally and ethically sourced, genetically and technologically diverse cohorts. This commitment ensures our solutions not only perform well in academic settings but are robust, reliable, and ready for the real world.

References

  • D’Amico, S., Dall’Olio, D., Sala, C., et al. (2023). Synthetic Data Generation by Artificial Intelligence to Accelerate Research and Precision Medicine in Hematology. JCO Clinical Cancer Informatics, 7, e2300021. https://doi.org/10.1200/CCI.23.00021
  • Mol Babu, G., Wong, K. W., & Parry, J. (2022). Federated Learning for Digital Pathology: A Pilot Study. Procedia Computer Science, 207, 736–743. https://doi.org/10.1016/j.procs.2022.09.129
  • Rashidi, H. H., et al. (2024). Generative Artificial Intelligence in Pathology and Medicine: A Deeper Dive. Modern Pathology, 38(4), 100687. https://doi.org/10.1016/j.modpat.2024.100687
  • Pozzi, M., Noei, S., Robbi, E., et al. (2024). Generating Synthetic Data in Digital Pathology Through Diffusion Models: A Multifaceted Approach to Evaluation. Scientific Reports. https://doi.org/10.1038/s41598-024-79602-w

Subscribe to Our Monthly Newsletter

Each month, we will send key data updates, stories from the field, and new research on inclusive oncology AI.

We respect your privacy. Unsubscribe at any time.