OpenChart-SE: crowd-sourcing training data for clinical text mining from Swedish physicians and medical students


Go directly to the OpenChart-SE form

Every year hundreds of thousands of patients are treated in Swedish hospitals. By studying them thoroughly, we could gain many new insights into disease conditions, e.g. learn to predict disease outcomes, understand common symptom combinations or detect common adverse effects. To do this, we would need to systematically go through the patients’ electronic health records and extract information about prior diseases and medications, symptoms and other patient characteristics. Unfortunately, it is not possible to do this manually as the amount of text that needs to be processed far exceeds the resources available for research.

Artificial intelligence models trained to extract specific types of information, such as symptoms, could help with this. To train such models, we need to have patient health records as training material. However, real health records are highly sensitive in their nature and cannot be shared openly. To solve this problem, we want to generate a collection of “fake” electronic health records, written by real Swedish health care professionals about imaginary patients. These can then be shared openly with researchers and used to train and evaluate a variety of artificial intelligence models without any privacy concerns. These models will learn the language typically used by Swedish health care professionals and can then be used on real patient records after appropriate evaluation and ethical approval.

If you are a health care professional or medical student working in Sweden, you can help create this valuable resource by filling in one or several “fake” health records here. This form mimics electronic health records used in Swedish emergency departments but some fields (e.g. image diagnostics and prior medication) have been left out on purpose to shorten the time required to fill in the form.

After completion of the data collection, the data will be made publicly available so that any researcher can benefit from it. If you would like to be acknowledged for your contribution, receive feedback on your form or be informed about the data release, you can choose to leave your contact details in the form.


What can be done with this data?

A specific form of artificial intelligence, called natural language processing (NLP), can be used to extract information from text. For example, trained clinical NLP models (a specific form of deep neural networks) could extract certain types of expressions (e.g. all symptom terms) to obtain patient statistics or to flag risk patients. For the English language considerable progress has been made in training clinical NLP models but for Swedish this development has lagged behind. To train such models a large collection of representative example texts is required, which is difficult to obtain in Sweden due to the strict protection of sensitive patient data and due to the technical difficulties in exporting electronic health records. For example, we have waited over a year after ethical permission to access patient records for our COVID-19 project. This project aims to remove this roadblock by generating training data that has no patient privacy concerns and can thus be shared freely. This will make it much easier for researchers to work on clinical NLP for Swedish patients.


This project is a collaboration between researchers from Lund University, Karolinska Institute and Region Skåne. It is funded by the Knut and Alice Wallenberg Foundation through the SciLifeLab National COVID-19 Research Program.

For more information about this project contact the project leaders Sonja Aits and Johanna Berg.