Designing Globally Scalable Oncology Data Systems

Executive Summary

Artificial intelligence may improve cancer screening, diagnosis, treatment selection, and evidence generation, but reliable deployment across health systems depends on multiple factors. The global cancer burden is increasing, with IARC/WHO estimating about 20 million new cancer cases in 2022 and projecting more than 35 million by 2050 [1]. Oncology data is shaped by population biology, sample preparation, sequencing technologies, scanner hardware, governance rules, and infrastructure. A global oncology data intelligence system should preserve clinical meaning, acquisition context, provenance, and jurisdiction-specific constraints from the outset [6-10]. Although oncology is used throughout as the reference case because of the maturity of its harmonization standards, the architectural principles apply to disease data more broadly.

Models trained on geographically concentrated datasets may generalize poorly when used in different populations, laboratories, and clinical workflows.

Scalability should be treated as an architectural property. Semantic interoperability, harmonized metadata, and data governance need to be engineered into the platform rather than added after deployment [6-9].

A governed discovery and access layer is part of the operating model, not a separate concern. Conversational or guided interfaces over a harmonized catalogue can carry controlled vocabulary, intent capture, and per-tenant scoping through to the point of data request. PaiX Navigator illustrates this pattern over the PAICON Datalake [19,21].

The recommended operating model combines harmonized common data models, controlled vocabularies, federated or distributed analytics where centralization is impractical, and audit mechanisms aligned with medical-device regulation [6-10,16,17].

Why Oncology Data Need Harmonization

Oncology data cannot be assumed to have the same meaning across sites even if they use similar labels or file formats. A cancer diagnosis, sequencing result, or histopathology image can differ in biological context, acquisition method, annotation protocol, and clinical documentation. For AI development, these differences matter because they can shift the input distribution and affect external validity [4-8].

Population and Disease Heterogeneity

Cancer biology varies across tumor type, molecular subtype, inherited background, environmental exposure, screening history, and prior treatment. A model trained on a limited demographic or institutional distribution may learn patterns that are valid in the training environment but fail to generalize elsewhere. Reproducibility should therefore be evaluated across the populations, devices, and workflows where the system is expected to operate [1,3,16].

Technical Variability

Digital pathology data is influenced by fixation, staining, section thickness, scanner optics, compression, and calibration. Genomic data is influenced by sequencing chemistry, read depth, error profile, panel design, alignment pipeline, and variant-calling method. Comparative sequencing studies show that performance characteristics can vary across platforms, so acquisition variables should be recorded as metadata and considered during preprocessing, model training, validation, and monitoring [4,5].

Workflow and Documentation Variability

Oncology records contain staging, treatment line, response, recurrence, progression, adverse events, biomarkers, and survival outcomes. These elements are not documented or standardized uniformly across institutions or countries. As a result, syntactic compatibility alone is insufficient; semantic alignment is required so that clinically equivalent concepts are represented consistently [6-8].

Architectural Requirements for Global Scalability

Context Capture

A scalable system should preserve the circumstances under which data was generated. This includes source institution, jurisdiction, consent constraints, patient-level demographic variables where legally and ethically available, sample preparation, assay platform, scanner or sequencer metadata, annotation protocol, and clinical workflow context. These fields support external validation, subgroup analysis, post-market monitoring, and root-cause analysis [9,10,17].

Semantic Harmonization

FHIR-based oncology profiles, mCODE, OMOP-CDM, and controlled vocabularies such as ICD, SNOMED, and LOINC can help preserve meaning across institutions. These standards should be implemented as part of the data model, not added as a late-stage translation layer [6-8].

Quality Control

Quality control should evaluate completeness, label consistency, image quality, assay reliability, site-specific artifacts, missingness patterns, and cohort balance. Site-level and subgroup-level validation should be routine, because aggregate performance can obscure clinically relevant failure modes [3-5,17].

Federated Analytics

In many cases, legal, logistical, and ethical constraints make central pooling of raw patient-level data impractical. Federated learning and distributed analytics can allow institutions to collaborate while keeping data within local governance boundaries. This approach does not remove the need for harmonization; it increases the importance of shared definitions, validation protocols, and lineage [9,11,16,17].

Governance, Regulation, and Traceability

Regulatory Posture

AI systems used for diagnosis or clinical decision support are generally subject to more demanding oversight than administrative analytics. In the European Union, the AI Act and the European Health Data Space emphasize risk management, data quality, transparency, human oversight, and rules for health-data access. In the United States, FDA guidance on AI-enabled medical device software increasingly reflects lifecycle management rather than one-time validation [9,10].

Data Lineage

A clinical AI platform should be able to reconstruct the data and model state associated with a specific prediction. At minimum, this includes the source record, preprocessing steps, harmonization mappings, dataset version, model version, inference-time inputs, prediction output, uncertainty or confidence measures, and post-market monitoring data. This approach aligns with regulatory expectations for traceability, risk management, monitoring, and change control [9,10,17].

Security and Provenance

AI pipelines can be affected by conventional cybersecurity threats as well as AI-specific threats such as data poisoning or model poisoning. Provenance tracking, access controls, dataset versioning, and monitoring of unexpected distribution shifts are therefore safety and security controls, not merely documentation features [17].

Evidence-Based Implementation Examples

AACR Project GENIE is a useful example of harmonized, multi-institutional cancer data sharing. Its value lies less in centralization itself and more in common data elements, shared definitions, and governance structures that support reproducible research across participating cancer centers [16].

Regional Research Operations

Clinical research operations also vary by region. Recent analyses describe shifts in clinical-trial activity across BRICS and G7 countries, while Brazil-specific oncology research assessments identify regulatory timelines and implementation capacity as practical constraints. These operational differences should be reflected in data-ingestion planning, quality assurance, and governance design [12,13].

MSI Prediction from H&E Pathology Images

AI-based prediction of microsatellite instability from routine H&E whole-slide images is an active area of research and product development. MSI/MMR testing is clinically relevant in colorectal cancer, and independently published work supports the feasibility of AI-based MSI pre-screening from H&E slides [18].

Digital Pathology Foundation Models

Foundation models for pathology may improve transferability if trained and validated across diverse tumor sites, scanners, staining protocols, and clinical contexts. Company-reported datasets, such as PAICON’s PaiX Datalake, may be relevant examples [5,19].

Low-Resource Diagnostic Pathways

AI-enabled point-of-care or mobile workflows may be useful in settings with limited specialist availability, but their value depends on integration with referral, confirmation testing, treatment access, data protection, maintenance, and local workforce capacity. Cervical cancer illustrates the implementation challenge: WHO reports that most cervical-cancer deaths occur in low- and middle-income countries, where screening and treatment pathways can be constrained [2,14,15].

PaiX Navigator: Governed Data Discovery in Practice

The architectural principles described in earlier sections including harmonized vocabularies, acquisition context, federated governance, and auditable lineage are necessary but not sufficient on their own. They need to be made operational at the point where a researcher or clinician actually tries to find and request data. PaiX Navigator, developed by PAICON, is designed to close exactly this gap.

The gap between a clinical data request and a runnable query is wide, and it is usually closed by hand. A discovery layer is the place to close it. PaiX Navigator is PAICON’s company-reported example of this pattern [21]: an agent-assisted interface over the PAICON Datalake in which a natural-language request is clarified into a structured intent, translated by a query agent against a harmonized catalogue, and routed to per-tenant audited delivery.

The catalogue surfaced through Navigator is expanding with new European biobank cohorts and customer-edge clinical data flows, each onboarded against the same controlled vocabulary so it is comparable with existing datasets at the moment of request. The customer-edge pattern is governance-driven: raw data is anonymized at the partner site, harmonized centrally, and exposed in Navigator only under that partner’s tenant, while the partner gains a governed view of their own cohort alongside the broader catalogue.

PaiX Navigator is now accepting beta registrations. Clinical and research data teams can apply for early access here.

Conclusion

Oncology AI should be evaluated as a clinical data system, not simply as a modeling exercise. Global scalability depends on whether the system can preserve meaning across institutions, capture acquisition context, validate across intended-use environments, comply with local governance requirements, and reconstruct decisions after deployment. The central design principle is straightforward: data context should travel with the data. When context, harmonization, lineage, and monitoring are engineered into the platform from the beginning, AI systems are more likely to maintain performance across varied clinical settings and to satisfy regulatory and operational requirements [6-10,17]. Discovery and request layers, for example PaiX Navigator [21], are one place these principles become operational for end users, by carrying harmonized vocabulary, intent capture, and per-tenant governance through to the moment data is requested.

References

[1] WHO/IARC. Global cancer burden growing amid mounting need for services. 2024. https://www.who.int/news/item/01-02-2024-global-cancer-burden-growing–amidst-mounting-need-for-services

[2] WHO. Cervical cancer fact sheet. https://www.who.int/news-room/fact-sheets/detail/cervical-cancer

[4] Quail MA et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012. https://pmc.ncbi.nlm.nih.gov/articles/PMC4955563/

[5] CAMIL: Context-Aware Multiple Instance Learning for cancer whole-slide images. https://arxiv.org/abs/2305.05314

[6] HL7. FHIR mCODE Implementation Guide. https://build.fhir.org/ig/HL7/fhir-mCODE-ig/

[7] OHDSI. Data standardization and OMOP Common Data Model. https://www.ohdsi.org/data-standardization/

[8] SPHN Semantic Framework. External terminologies including ICD-O-3, SNOMED CT, and LOINC. https://sphn-semantic-framework.readthedocs.io/en/latest/external_terminologies/external_terminologies.html

[9] European Commission. European Health Data Space regulation. https://health.ec.europa.eu/ehealth-digital-health-and-care/european-health-data-space-regulation-ehds_en

[10] FDA. Marketing submission recommendations for a predetermined change control plan for AI-enabled device software functions. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/marketing-submission-recommendations-predetermined-change-control-plan-artificial-intelligence

[11] OECD. Cross-border data flows and data localization. https://www.oecd.org/en/topics/sub-issues/cross-border-data-flows.html

[12] Clinical trial trends in BRICS and G7 countries, 2018-2022. https://pmc.ncbi.nlm.nih.gov/articles/PMC11318782/

[13] Improving access to cancer clinical research in Brazil. ecancer. https://ecancer.org/en/journal/article/1698-improving-access-to-cancer-clinical-research-in-brazil-recent-advances-and-new-opportunities-expert-opinions-from-the-4th-cura-meeting-so-paulo-2023/pdf

[14] World Bank. Digital Health Blueprint Toolkit. https://www.worldbank.org/en/topic/health/brief/digital-health-blueprint-toolkit

[15] WHO. Digital implementation investment guide. https://www.who.int/publications/i/item/9789240010567

[16] AACR Project GENIE. https://www.aacr.org/professionals/research/aacr-project-genie/