gms | German Medical Science

68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)

17.09. - 21.09.23, Heilbronn

Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure Score Analytics

Meeting Abstract

  • Tim Ingo Johann - Klaus Tschira Institute for Integrative Computational Cardiology and Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany
  • Karen Otte - Medical Informatics Group, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
  • Harald Wilhelmi - Klaus Tschira Institute for Integrative Computational Cardiology and Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany
  • Martin Lablans - Complex Data Processing in Medical Informatics (CMI), Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
  • Fabian Prasser - Medical Informatics Group, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
  • Christoph Dieterich - Klaus Tschira Institute for Integrative Computational Cardiology and Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 68. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS). Heilbronn, 17.-21.09.2023. Düsseldorf: German Medical Science GMS Publishing House; 2023. DocAbstr. 223

doi: 10.3205/23gmds011, urn:nbn:de:0183-23gmds0117

Published: September 15, 2023

© 2023 Johann et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: Developing machine learning-based AI methods usually requires access to large amounts of data. However, personal health data is particularly sensitive and not easily accessible to data scientists or software developers. To improve this situation, classical anonymization methods based on aggregation and generalization, as well as modern data synthetization methods can be used.

Methods: We chose results from our previous HiGHmed Use Case Cardiology study by Sommer et al. [1] to study the usefulness of protected data generated through traditional anonymization and data synthetization. To this, we compared the Barcelona BioHF version 1 and MAGGIC heart failure risk scores for 1-year and 3-year mortality calculated on the original dataset with those derived from anonymized and synthesized data.

In the anonymization process, we first removed all variables that are not needed for the use case studied i.e., the calculation of MAGGIC or BioHF heart failure scores. We then applied different transformation methods to the 17 remaining variables. This was performed as an automated process using the optimization algorithms provided by ARX [2], [3]. In terms of protection levels, the dataset was transformed to fulfill the k-anonymity property with k=2, which means that the resulting dataset did not contain any unique records regarding all variables.

For the data synthesis process, we used the open-source software suite SDV [4], [5], which provides four distinct tabular data synthesizers (GaussianCopula, TVAE, CTGAN, and CopulaGAN). We trained each of the synthesizers with the original data, and subsequently generated sufficient datasets for HF score calculations. The best performing model (out of 4) was selected post hoc by comparing the synthesized and original input variable distributions using appropriate statistical tests.

Results: In the original data set, we computed 890 MAGGIC scores from 2441 partially incomplete patient records [1]. The number of computable MAGGIC scores was lower for the anonymized dataset (n=750) and synthesized data set (n=416). In the latter, we synthesized 2441 records for comparison. The median 1yr mortality risk was 0.093 for anonymization, 0.1065 for synthetization and 0.111 in the original data set. We also compared the score distributions using a two-sided Kolmogorov-Smirnov test (p=0.02362 for anonymized data and p=0.1078 for synthesized data). Similar results were obtained for the BioHF scores.

Figure 1 [Fig. 1]

Discussion and conclusion: While both methods yield results in good agreement with those of the original (non-de-identified) data they both have their distinct advantages and draw-backs. Synthetic data models may be trained automatically and could produce any necessary amount of data. However, they may suffer from the inability to capture complex statistical properties on smaller data sets. Traditional anonymization processes, on the other hand, reduce the amount of available data as compared to the original dataset, and may better preserve the data structure.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Sommer KK, Amr A, Bavendiek U, Beierle F, Brunecker P, Dathe H, et al. Structured, Harmonized, and Interoperable Integration of Clinical Routine Data to Compute Heart Failure Risk Scores. Life (Basel). 2022;12(5):749. DOI: 10.3390/life12050749 External link
2.
Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX — Current status and challenges ahead. Software: Practice and Experience. 2020;50(7):1277-1304. DOI: 10.1002/spe.2812 External link
3.
Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. ARX Data Anonymization Tool. Available from: https://arx.deidentifier.org/downloads/ External link
4.
Patki N, Wedge R, Veeramachaneni K. The Synthetic Data Vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA); 2016, Oct 17-19; Montreal, QC, Canada. IEEE; 2016. p. 399-410. DOI: 10.1109/DSAA.2016.49 External link
5.
Johann TI, Wilhelmi H, Dieterich C. ASyH - Anonymous Synthesizer for Health Data. Available from: https://github.com/dieterich-lab/ASyH External link