A new algorithm is able to diagnose allergy in children with the help of artificial intelligence (AI), using only three DNA markers. The algorithm was developed in a cross-border collaboration between data scientists of MIcompany and researchers from the University Medical Center Groningen (UMCG). The algorithm proved accurate across different populations and is the result of a large-scale study involving six unique datasets from around the world. The results were published recently in Nature Communications and contribute to a better understanding of these complex diseases and offer opportunities for innovative diagnostics in the future.
Allergies burden quality of life and are difficult to diagnose in young children
Allergic diseases, such as asthma, eczema, or hay fever, are very common childhood diseases worldwide that place a significant burden on patients’ quality of life and healthcare systems. The prevalence of these diseases has been increasing rapidly for more than 50 years. Researchers expect half of the European population to suffer from an allergic disorder by 2030. Although genetic and environmental factors are known to play a major role in its development, the exact mechanisms are still unknown. As a result, it remains a chronic disease for which no permanent cure is currently available. However, with early diagnosis, preventing treatments and medicines can be provided. Prof Gerard Koppelman, pediatric pulmonologist at UMCG and co-initiator of the project, explains: “Young children often suffer from brief illnesses in which the symptoms may resemble an allergic condition, for example attacks of shortness of breath or frequent colds.”
A massive growing amount of data is available in the biomedical field
Over the past decade, the amount of human DNA data has doubled every seven months. Marnix Bügel, founding partner of MIcompany and co-initiator of the project: “DNA data is one of the biggest and fastest growing data source in the world. For our project, we used data from different layers of the human genome, called multi-omics, which offers a new and unprecedented level of insight into diseases.” Merlijn van Breugel, the Lead Data Scientist at MIcompany who developed the algorithm, adds: “The size and complexity of these data is beyond what I had ever seen. We literally had millions of data points per child.” The Groningen Research Institute for Asthma and COPD (GRIAC) has a birth cohort dataset which was used for the project. This dataset is unique in the world and includes data about these different omics layers for a large number of children. Before a first data filtering, over 2.8 million genetic and 435 thousand epigenetic markers from both blood and nasal cells were available. As candidate features for the modeling approach, 136 genetic, 353 epigenetic markers, and a range of environmental factors were used. Additionally, five genetic risk indicators were engineered based on over 6000 DNA positions.
Six AI algorithms were trained and optimized for best predictive performance
To use all this data for allergy prediction within children, six machine learning algorithm were assessed, including XGBoost, Support Vector Machines and Elastic Net (Figure 1). The model was optimization via grid search hyperparameter tuning and a step-wise feature selection approach. To overcome the imbalance in the data – allergy prevalence was just 20% – oversampling techniques were assessed to improve model performance further. Surprisingly, the final model only required 3 nasal CpG sites to accurately predict allergy, with an area under the curve (AUC) of 0.86. The model was validated using a repeated cross-validation setup. “This is a great example where less can be more”, Merlijn adds. “A simple, or parsimonious, model can be more robust with lower overfit and easier to interpret, while still delivering high accuracy.” A comparison to an earlier study, which selected 30 CpG sites rather than 3, performed worse in the same dataset. Further analysis of the 3 identified sites showed that they are related to over-expression of certain genes in immune cells.
Figure 1. Performance comparison of different algorithms
The algorithm was validated on the other side of the globe
The algorithm also works well in children in other populations, with a similar AUC value of 0.82 (Figure 2). The algorithm accurately diagnosed allergic diseases in an independent Puerto Rican cohort. This indicates that the algorithm indeed captures general biological signals present in other ethnic groups. This type of external validation is the golden standard in medical research to test whether the findings are reliable. However, the current algorithm was developed for 16-year-olds. As a result, the researchers found that the algorithm is less accurate in two cohorts with 6-year-old children. Koppelman: “Although this discovery is an important step forward in the application of artificial intelligence to diagnose allergy, we need to calibrate our algorithm for the younger age group in the future.” After the publication, Koppelman was interviewed by RTV Noord to share the relevance of this study. This interview (in Dutch) can be listened to here.
Figure 2. Performance of algorithm on 3 replication cohorts, as shown in Nature Communication publication.
Artificial intelligence in complex diseases
In 2019, UMCG and MIcompany joined forces to conduct research by applying the latest techniques in artificial intelligence to complex, biomedical problems. Initiated by Gerard Koppelman and Marnix Bügel, the new algorithm was developed by a joint research team as part of this public-private partnership. The combination of expertise was key to the success of this study: artificial intelligence enables researchers to analyze large and complex data sets in a new way, and a deep understanding of such data and the underlying biology is crucial to reach meaningful conclusions.
For almost three years, we have been closely collaborating with our colleagues at UMCG to apply AI in medical research. On that account, we were recently asked to share our perspective on a Nature article about SARS-CoV-2 (by Stukalov et al., 2021) for the News & Views section of Nature Immunology.