Across many countries, particularly in low and middle income settings, diagnostic systems remain fragmented, uneven and difficult to scale. Health data are often stored across multiple platforms, collected in inconsistent formats and governed by unclear rules on access, privacy and quality. Almost half of the world’s population continues to lack timely and accurate diagnosis, and frontline health workers have to rely on incomplete, siloed or poor quality data, making it difficult to deliver equitable care. Artificial intelligence (AI)-based tools can help, but only if the data supporting them are trustworthy, representative and well governed.
Without a robust data framework, AI-based tools face predictable risks[1]: models trained on incomplete or unrepresentative data can become biased[2], results can vary across population subgroups, outputs cannot be reproduced, and confidence in AI tools erodes among clinicians, programme managers and patients. Weak storage, versioning and governance structures make it difficult to track how data evolve, what version of a dataset was used to train a model, or how decisions about data access are made. These gaps directly undermine reliability, safety and fairness, three elements that are essential if AI-based diagnostic tools are to be useful in real-world health systems.
Therefore, a diagnostic data framework is essential, as AI-based tools cannot function effectively without good data. By establishing clear rules, technical practices and governance mechanisms, a diagnostic data framework ensures that AI based diagnostics are not only technically robust but also equitable, trustworthy and sustainable across the health systems that need them most.
This Data Framework for AI-Based Diagnostics is designed to guide ministries of health, implementing partners and developers of digital health solutions by providing a comprehensive, structured blueprint to support the ethical, equitable and technically sound implementation of AI technologies in healthcare diagnostics. The framework addresses the complete data lifecycle, including data collection, annotation, validation, sharing, monitoring and reuse, while embedding the governing principle of privacy, interoperability and inclusivity.
While the framework is agnostic, it is designed to be localized and adapted to national regulations and ground realities. It can be adopted as a national reference architecture, or it can be used to stress-test discrete investments, such as a tuberculosis (TB) screening pilot study, against minimum requirements for lawful, ethical and operationally feasible data use. Thus, this Data Framework for AI-Based Diagnostics aligns with widely recognized global principles for the use of trustworthy health data and responsible AI, including the World Health Organization’s guidance on ethics, UNESCO’s recommendation on ethics[3] and governance of AI for health, and the OECD Recommendation on Health Data Governance (World Health Organization, 2021[4]; OECD 2016[5]; OECD 2022[6]).
This framework is rooted in global standards and principles, such as FAIR (Findable, Accessible, Interoperable, and Reusable)[7], and affords the inclusion of ethical data governance norms, including data privacy, consent, equity and responsible reuse. While technically comprehensive, the framework is also grounded in a human-centred approach that prioritizes data equity, diversity and transparency.
The framework is built on six pillars that describe the lifecycle of health data, including data collection, management, annotation and monitoring. This applies to various types of health data, including electronic medical records, electronic health records, personal health records, laboratory results and genomics information. Each pillar operates under three guiding principles: ensuring equitable access, protecting privacy and maintaining a secure, ethical and scalable foundation. The framework provides guidance on:
Governing principles: covering secure data storage, infrastructure, governance and privacy, aligned with legal and ethical norms.
Data collection: promoting demographic and clinical diversity and inclusive data types, and establishing standardized processes for digitization, terminology and language localization.
Data cleaning and Validation: implementing automated pipelines for data quality checks, error reporting and integration with health data systems[8].
Data annotation and structuring: enabling diverse machine-learning approaches through appropriate data structuring, such as the creation of interoperable formats, annotation tools and semantic mapping.
Data integration and harmonization: encouraging multimodal data merging using application programming interfaces (APIs) and the FAIR principles to improve data utility and comparability.
Data sharing and reuse: ensuring responsible access, licensing and version control to support open science while protecting sensitive information.
Continuous monitoring and feedback: tracking data drift, maintaining model performance and fostering feedback loops to support ongoing improvement.
This framework serves as a strategic reference for shaping national and institutional approaches to AI in diagnostics. Although developed for AI-based diagnostics, its principles are broadly applicable across sectors and geographies, particularly in advancing data justice and digital health equity.