Artificial intelligence and rare diseases: when data is scarce, technology innovates

The term “big data” is by now inextricably associated with artificial intelligence in medicine. But what does it mean exactly? As Prof. Diciotti clarified, the concept has a dual meaning: on one hand it refers to the large number of patients studied (the size of the sample), on the other to the multiplicity of information that can be collected on each patient — information of different kinds coming from different disciplines such as clinical medicine, genetics, imaging and other sources.
In the case of rare diseases, however, a paradox emerges: the application of AI becomes particularly complex precisely because of the scarcity of data. It is difficult, by definition, to obtain true “big data” when studying conditions that affect only a small number of patients, yet this data is essential for artificial intelligence models to function effectively.
To address this challenge, Prof. Diciotti shared the experience gained within an international network dedicated to spinocerebellar ataxias — a form of rare neurological disease. The network involves research centers in Europe, the United States, Mexico, Australia and other countries, and has made it possible to collect data on over 300 patients with Friedreich’s ataxia and several hundred patients with other spinocerebellar ataxias, in addition to a comparable number of control subjects. A multicentre study of this scale is essential to reach statistically significant numbers.
First challenge: the comparability of data between different centers
The first concrete problem concerns the comparability of data collected at different centers. To illustrate the issue, Prof. Diciotti used an illuminating example: “Let us think about body weight: it is an important clinical parameter. If I want to know how much I weigh, I use a scale. If I want to compare the weight of patients between the network’s centers, I can be confident: every scale is calibrated, zeroed before measurement, so the data are comparable.”
But what happens when instead of a scale, an MRI scanner is used? An MRI is not an easy instrument to calibrate. If one center uses a Siemens machine and another uses a GE device, the results can vary significantly. It is obviously not feasible to have a subject travel from one center to another to undergo MRI in all the different machines available.
This is where post-acquisition harmonisation statistical methods come in — techniques that make it possible to render data acquired with different instruments comparable. However, these methods create new problems when one wants to use artificial intelligence, because the harmonisation process can interfere with the specific characteristics on which AI models need to learn.
This problem was addressed by Prof. Diciotti’s research group in a recent study published in the journal Scientific Data, based on data from 1,700 subjects from 36 different centers. The study explored methodologies for balancing the need for harmonisation with the need to preserve information relevant to the training of artificial intelligence models.
Second challenge: privacy and federated learning
The second issue concerns the sharing of clinical data between different centers. Privacy can become a significant obstacle: some centers cannot transfer patient data outside their own site due to regulatory, ethical or security constraints.
The proposed solution is federated learning — an innovative approach that revolutionises the way AI models are trained in distributed contexts. As Prof. Diciotti explained, “in this approach, each center keeps its own data locally, trains an AI model on its own patients and shares only the ‘trained’ model — that is, the model weights — not the data.”
The mechanism works through an iterative process: locally trained models are aggregated into a common model, which is then redistributed to all centers and further refined. This method makes it possible to build powerful AI models without ever moving sensitive patient data from their original sites.
For rare diseases, this approach is particularly valuable: it makes it possible to avoid the loss of centers from the network due to privacy constraints — a problem that in rare conditions, where every patient counts, must be absolutely avoided.
Third frontier: synthetic data
The final innovation presented concerns the generation of synthetic data — that is, artificial but extremely realistic data. Even using federated learning, the available data may not be sufficient to effectively train AI models.
Prof. Diciotti showed a particularly effective visual example: two images of human faces, one real and one artificially generated by an AI system. The ability to distinguish them with the naked eye is practically non-existent, so realistic are the synthetic images.
The same principle applies to biomedical images. The research group has developed models capable of generating fictitious but indistinguishable brain MRI images. This technology opens enormous possibilities: large datasets can be created to train AI models without violating patient privacy and with a high degree of control over the quality and characteristics of the generated data.
Synthetic data can be customised to represent specific scenarios, increase dataset variability, balance underrepresented classes or simulate rare pathological conditions for which real data is extremely scarce. This technology represents a true revolution for rare disease research.
Three directions for the future of Precision Medicine
As Prof. Diciotti concluded, these three stories are only pointers, but they clearly show the strategic directions in which research applied to artificial intelligence in medicine is heading:
Data harmonisation — Developing increasingly sophisticated methods to make data from different instruments and protocols comparable, without losing the information essential for AI.
Privacy protection with distributed AI — Implementing federated learning and other distributed learning techniques on a large scale, making it possible to build powerful models while respecting regulatory constraints and protecting patient privacy.
Creation of synthetic data — Refining artificial biomedical data generation technologies to increase the availability of training datasets, especially for rare conditions where real data is scarce.
AI and HEAL ITALIA: a strategic alliance
Prof. Diciotti’s presentation highlighted how artificial intelligence is not merely a technological tool, but an essential enabling element for Precision Medicine in rare diseases. The solutions presented — data harmonisation, federated learning and synthetic data — are already available and operational; they do not represent futuristic visions but concrete tools that can be implemented today.
In the context of the HEAL ITALIA project, these technologies take on even greater value. The national network of Precision Medicine Centers distributed across the territory can benefit enormously from federated learning to collaborate while keeping data locally. The generation of synthetic data can amplify the capacity for study even for ultra-rare diseases with very few diagnosed patients. Data harmonisation makes it possible to integrate information from different centers with different technologies, maximising the value of every single piece of data collected.
As Prof. Diciotti emphasised, these tools “will also be fundamental in the near future for effectively applying artificial intelligence to Precision Medicine, especially in the field of rare diseases.” A future in which technology does not replace but enhances the human capacity to understand, diagnose and treat even the rarest and most complex conditions, transforming data scarcity from an insurmountable obstacle into a solvable technological challenge.



