Diasorin - Advanced Clinical Trial Analysis

Diasorin uses machine learning and data augmentation techniques for advanced analysis of its clinical trial data, adopting an explainable AI approach

Solutions:

Data Platform - Data Products

Technologies:

Azure - Python

Business Summary

For more than 50 years, Diasorin has been developing, manufacturing and marketing laboratory diagnostic reagent kits worldwide through 45 companies, 4 divisions, 10 production sites and 9 research and development centres. This organisation enables Diasorin to have a broad offering of diagnostic tests and licensed technology solutions, made available through continuous investment in research.

The variety of the offer and investment in research and innovation qualify Diasorin in its market as the player with the widest range of specialised solutions in the sector and identifies the Group as the ‘Diagnostics Specialist’.

Challenges

Technological and methodological standardisation of processes
Development of a data analysis with advanced techniques that could accurately predict the relapse from a non-invasive diagnostic test
Development of a generic data model that could incorporate the results of several clinical trials carried out or commissioned by Diasorin

Diasorin’s objective is to analyse the data of a diagnostic clinical trial (already studied by Diasorin with traditional statistical methods) with advanced statistical and machine learning methods to correlate more robustly the results of the tests performed with the health status of the trial patients. This was a prospective longitudinal study conducted in Italy, France and Spain.

The primary objective was to correlate LIAISON® Calprotectin test measurements with quiescent ulcerative colitis (UC) or relapse as assessed by clinical data. Patients were evaluated every 3 months for 12 months and finally at 18 months after the start of the trials.

There were four main challenges in the analysis of this trial:

Effective data reading

Need for a logical and relational model of clinical trial data that would allow standardised data storage and efficient and effective reading. The model was supposed to store the data of the trial under analysis, but also be conceived and designed in a generic way so that it could be used in other similar use cases.

Model selection

Choose a machine learning model that would allow the data to be analysed in an optimal and replicable way. The model had to predict disease relapse taking into account all collected patient features and had to use the statistical parameter AUC (Area Under the Curve) as correlation validation.

Consolidating results

Consolidate any positive results obtained from the model (or emphasise as clearly as possible the lack of correlation between variables) by extending the dataset and simulating any stress points that the analysis would have had to withstand.

Model explanation

To use a model that was not a ‘black box’, i.e. one that could best explain why the correlations between the variables were so strong, which of them weighed most heavily in the final result, and what the impact of any variable exclusions from the model would be.

Solution

In order to achieve all four goals that Diasorin had in mind, Quantyca adopted a strategy that would allow the problem to be assessed from an experimental and specific perspective for the submitted use case, but at the same time could serve as a basis for the construction of a cloud data lab that would remain for Diasorin even after the project was completed.

Model selection

A statistical analysis on the results of the trial under review had already been carried out by the Diasorin team: the aim was to use less classical calculation methods to investigate the result obtained and to explore the possibilities of improving it in order to understand the effectiveness of the diagnostic method both from a scientific and a business point of view.

We then analysed the dataset and first organised it for automatic reading and added calculated columns that could bring interesting and relevant correlations.

We then developed a model with different available algorithms (logistic regression, random forest, XGBoost…) analysing the pros and cons from a statistical point of view, comparing them with the cardinality of the database and also from a technical/maintenance point of view.

Stress test

For Diasorin and the scientific team behind the clinical trial, it was essential to get money from the analysis of the results and to take into account any ‘confounders’ to the AUC obtained. Therefore, once the model had been chosen and the main analyses conducted on the cleaned and enriched dataset, we supplemented the patient database with synthetic data (using the SMOTE and ADASYN methods) to make sure that the optimal AUC obtained did not fluctuate with the cardinality of the input data.

Analysis of the distribution of the AUCs obtained with the various synthetic data injections (both in modality and cardinality) through various statistical scatter indices showed excellent stability of the results, making the analysis robust and the diagnostic trial highly meaningful.

Explaining the results

In order to explain the level of correlation and importance that each feature taken into consideration and used in the logistical analysis brought to the model, several analyses were implemented with the SHAP library, also with the help of graphs and visual tools. Dashboards supporting the explanatory analysis were then built in Python to make the results more readable and comprehensible even to non-technical users and, thanks to these tools, the discussion of the model was brought to the team of doctors, biologists and researchers who had conducted the study, addressing the significance of the weight of each variable also from a scientific point of view in an agile and transversal manner across the team’s skills.

Datalab, DB and MLOps

To work optimally with the available data, a cloud architecture on Azure was devised that would support this use case and ultimately remain in the Diasorin IT estate.

Within this architecture, we also set up a SQL Server database in which we created a Data Model to efficiently accommodate clinical trial data, creating entities and tables that corresponded to the language shared and approved by the various public drug agencies.

Furthermore, in order to benefit from the results of the model and to start a more in-depth understanding of these analysis methods within Diasorin, we developed a webapp that allows the inference of the model by imputing the data of a patient required to make a prediction of disease relapse.

Results

Diasorin achieved several objectives:

Validation and consolidation of an important and critical clinical trial with innovative analysis and means
Recognition of the validity of the trial and the method of analysis through the publication of an article in the United European Gastroenterology Journal.
Efficiency of data management and interpretation of model results
Cloud architecture to handle data and models across different use cases

Risorse

Whitepaper

Free

01/10/2022

LIAISON® Calprotectin for the prediction of relapse in quiescent ulcerative colitis: The EuReCa study

Diasorin - Advanced Clinical Trial Analysis

Challenges

Effective data reading

Model selection

Consolidating results

Model explanation

Solution

Model selection

Stress test

Explaining the results

Datalab, DB and MLOps

Results

Risorse

LIAISON® Calprotectin for the prediction of relapse in quiescent ulcerative colitis: The EuReCa study

Need personalised advice? Contact us to find the best solution!

Join the Quantyca team, let's be a team!