work packages

  1. Data extraction system
  2. Clinical trial infrastructure
  3. Predictive models in lung cancer
  4. Simulation of new treatments via in-silico trials
  5. Implementation

1. Data extraction system

The objective is to create an innovative IT framework that leverages data (medical, imaging, treatment, biological, lab, genomic, psycho-social) that is already being collected in local databases of multiple institutions and makes this data available without any demand on the type and way in which the data is collected locally.
The data extraction system needs to be flexible enough to cope with all the difference in local data availability, structure, type and meaning while at the same time present the data in a coherent model to facilitate analysis. For this, a common metadata model has already been developed in preliminary work, which is disease and cohort based, and allows variation in data collection between diseases and institutions.
The next step is to mirror the institution data into the local euroCAT database, which can connect to different data systems and will automatically collect data from standard databases and protocols (DICOM, HL7, etc.). The system will host the patient data as a mirror of the institution. Site-specific tools that expose the data at the euroCAT meta-levels have been developed for MAASTRO in preliminary work but will need to be customized for each participating centre. For example, different cohorts of patients can be defined based on different sets of definitions for each site. Also tools developed in preliminary work that can handle missing data and are able to perform advance data interpretation (e.g. analysis of free text fields, see figure below), will be used in the mirror process. Finally, in the euroCAT system, data integrity checks (e.g. weight of 120kg is not compatible with a Body Mass Index of 20) will be applied, ensuring proper data quality prior to making the data available


By developing and managing the data in this virtual hierarchy, it is also possible to develop privacy-preserving data mining algorithms which can be executed on various levels of the hierarchy, without any residual risk of data privacy non-conformance. To enhance security, and help overcome privacy regulations, the local euroCAT database will reside and remain within the institution. Furthermore, the euroCAT database simply holds a copy of data already present within the institution and does not interfere with the clinical process.

2. Clinical trial infrastructure

When the objective of action 1 is met, the data will be available in the euroCAT data model and local euroCAT database at each institution. However, to address the objectives of this proposal, it is necessary that the data is shared between institutions. This will enable data to be used for both development and validation of prediction models, and support the development of new applications such as in-silico clinical trials. The approach taken in this action is the development of a network that connects the local euroCAT databases and a unified interface to get access to the data in a way that preserves the patient privacy. One approach that will be considered is a GRID-based network. This would build upon pioneering work in this area in the project Health-e-Child (, which is sponsored by the EU 6th Framework, and led by Siemens (one of the partners in this proposal). Also the progress of the Dutch initiative Parelsnoer ( will be closely monitored.

First, detailed requirements will be elicited from future users, as to how they would like to browse, query and get access to the data. From initial elicitation and the preliminary work of the project group, a definite requirement is to simply be able to query how many patients with certain characteristics are available in the database at a given institution. As shown below (see action 4), a prototype of a web-based interface that can perform these basic tasks is already in place in MAASTRO.

This interface can not only be used to know if sufficient high quality data is available for machine learning but could also be extended for setting-up clinical trials in which it is important to get some certainty as to the expected inclusion of patients in participating centres. The project group has active members in the EHR workshops ( that specifically try to link hospital data to clinical research to improve the efficacy of trials. Another known requirement is to be able to apply machine learning across institutions, i.e. to have a system that allows the data to be mined as one database. For the latter, the GRID will allow mining applications to be deployed across it, but will not allow sharing of data, nor will data be collected at a central point, which differentiates the current project from other similar network projects. Although properly de-identified data can be shared under Dutch law, we feel that by keeping data within the hospital of origin, many privacy concerns (e.g. traceability) will be addressed which will make the system expandable to other countries and continents.

3. Predictive models in lung cancer

The abundance of new medical information at the molecular, organ and system level, and the trend to record healthcare data electronically has opened up the way to use innovative bio-informatics approaches to gain new knowledge on diseases and their management, i.e. predictive models. In this action we will develop predictive models that use data from multiple sources, because our previous work has shown that models based on a combination of imaging, clinical and biological date improves prediction of survival radiation induced lung damage and esophagitis. Adding data acquired during treatment such as FDG uptake and neutropenia further increases the accuracy of the prediction. We have used both classical statistical models as well as more advanced machine learning techniques such as support vector machines As we will have access to the large amount of available data in this project, we will focus on machine learned models in this project, as we have found that the number of patients is at the moment the limiting factor in the accuracy (AUC) of our current models (the following figure shows the predictive model performance in terms of AUC with different number of available patients in the model building process. The trend analysis indicates that to yield an AUC score of 0.85 and above, we probably need at least 1000 patients).

Whereas diagnostic models are usually used for classification, predictive models incorporate the dimension of time adding a stochastic element. Therefore, predictive models for a certain treatment should not rely solely on Discrimination (C-statistic or AUC), the ability to separate those with various disease states, but should also assess Calibration (the Hosmer-Lemeshow goodness-of-fit test), the risk to correctly estimate a future non-existing event. Risk classification (The Net Reclassification Index NRI) can aid in comparing the clinical impact of various models on risk for the individual

The models developed in this action will not only be data-driven but will also integrate available literature and local knowledge, as we have shown that this is possible and improves predictions.

With the aforementioned preliminary work in progress, the infrastructure built in action 1 and 2 gives us the new ability to verify our models using more patients and more variety in patients by inclusion of data from multiple centres. At the moment verification is severely hampered by the difficulty in obtaining verification sets of sufficient quality. If the predictive models turn out to have a high accuracy in independent data sets, acceptance and use of the models for decision support at the point-of-care by the project members and in the wider cancer community is the next step. As an example, the prototype of such an interface currently at MAASTRO is shown below (this interface is able to show a personalized survival curve for a patient at different time points after therapy. At each time point, the survival probability and the confidence interval can be read from the curve. Currently we are in the process of validating this predictive model).


4. Simulation of new treatments via in-silico trials

Classically, proving the efficacy and safety of a new treatment requires large scale clinical trials (evidence based medicine), which are expensive and account for the majority of cost of new drugs and treatments. With the rise in healthcare costs, it is becoming common to also include cost-effectiveness analyses in the assessment of current or new medical practice (health technology assessment). It is the objective of this project to develop an innovative framework that both reduces the cost of clinical trials (and thus treatments) and estimates the cost-effectiveness of treatments.

The core approach is that predictive models developed in this project are used to identify the patients in which a treatment has a high benefit. Because radiotherapy has the distinct advantage that a good predictor of treatment efficacy (i.e. radiation dose) exists, the radiotherapy community and our group in particular already use so-called "in-silico trials" or planning studies to determine if a new technique or treatment is expected to be meaningful improvement of the current practice. The predicted benefit will be measured by traditional outcome parameters (survival, complication) but also by the associated cost-effectiveness measures (e.g. QALY), similar to a multi-centre study we recently started for a new, costly form of radiotherapy (ion therapy). At the moment these in-silico trials are very cumbersome and time-consuming as they require data collection and analysis at multiple centres. The infrastructure developed in action 1 en 2 will enable us to perform more and more efficiently these types of studies. In this action we will extend these types of trials for treatments in which predictors of outcome are scarce (e.g. chemotherapy). We will use the predictive models in lung cancer of action 3 to develop in-silico trials of new forms or schedules of radiotherapy in combination with new pharmacological agents or surgery that are expected to increase cure or reduce complications. We believe that having reliable information from in-silico trials early in the product cycle of a new treatment allows a more rapid transition from bench to bedside, making evidence based personalized medicine possible at an acceptable cost. A summary of the approach taken in this action is given below:


5. Implementation

It is critical that technological advances developed in the context of the euroCAT project can be commercially exploited.

A number of commercial products and/or activities will result from this project.

1. A system supporting clinical research: Data extraction tools and its exploitation.

We envisage that the data extraction system, that selects patient for trials and enables the selection of data for analysis will be of enormous value to research or quality minded health care institutions, as it has been to MAASTRO. Therefore the newly developed IT solution could be sold as product.

From our previous experience we now know that data mining applications that work off data found only at one institution will never be sufficient enough to undertake robust research in personalized medicine. We see this infrastructure being developed as a product that will enable researchers at multiple institutions to work together and carry out the high level of research necessary to make significant steps towards personalized medicine.

2. A Euregional company managing a Euregional Trial centre:

We also see the data extraction infrastructure connected to the databases of different hospitals being presented as a product to companies to run effective and cost efficient multicentric & international trials with access to large numbers of patients and variables in several international centers. We will make a business plan to see the viability of such a company managing this service.

3. Predictive models leading to a decision support system.

A decision support web based service centered on predictive models on various treatments will be marketed to hospitals. This service will enable hospitals and clinicians to validate and select the best treatment for patients. A further product will be a predictive or patient selection model for a company that is developing a treatment for a specific patient category. In both products, if, the predictive models are extended to include cost and cost-efficiency predictions (including extra features such as the choice of the most complementary cost efficient diagnostic procedure), they should also be of interest to governing bodies and health insurers. We anticipate that such a decision support model will be legally obligatory within a few years. This support system will enable the doctor and the patient to make the most appropriate choice from the multiplicity of therapeutic options.

4. De-identified data allowing advanced multi dimensional modeling.

It will be of considerable interest and a marketable service to research institutes and companies to be able to have access to deidentified multifactorial medical data (clinical, biological, imaging and therapeutic data with outcome) which will evaluate accurately a market size, develop new prediction models and perform in-silico trials. The FDA, in a recent vision document on the future of health care, hypothesized that medical R&D will primarily be in-silico and that clinical trials will be designed to confirm to the in-silico findings. It is our belief that this approach could be applied in the short term in radiation oncology for which in-silico trials (in the form of treatment simulations) are already common practice.

The market for these types of products is unexploited, e.g. there are very few patents and products in this area. Nevertheless, it is clear that there is a shift away from evidence-based medicine, and towards personalized medicine. Recent scientific advancements, including the decoding of the human genome, have led to an explosion in research in genomics and proteomics. This work will revolutionize the way medicine is practiced. On the completion of this project commercial companies will be able to capitalize on this paradigm shift in medicine, and design and produce products and services that can service this new field.