Using Spark NLP to De-Identify Doctor Notes in the German Language

The ability to extract clinical information at large scale and in real time from unstructured clinical notes is becoming a mission critical capability for IQVIA. Key data elements like tumor stage & size, Social Determinants of Health, and ejection fractions are not available in typical structured EMR records. Additionally, the Cures Act Final Rule brings the unstructured notes into play from Oct 22 in the US, with the EU likely to follow suite. The market expectation will rapidly be that CRO’s and data vendors are able to leverage unstructured notes for clinical trial recruitment, study registries, precision medicine, and manufacturing longitudinal datasets. The first step in unlocking the value of unstructured healthcare data is efficient handling of Personally Identifiable Information. Accurately anonymizing medical data is challenging when data is in multiple languages, beyond English, and for short doctor notes that pose a different challenge from proper documents with paragraphs and sentences. In this talk we show the data flow for probably the largest multi-country EMR data platform in the world, focusing on the de-identification module that allows to safely ingest doctor notes and open them up for analytics. We go into details of the Spark NLP de-identification model and pipeline built to handle German texts, share results, and summarize lessons learned, including combining rules and models for types of entities where a rules-based outperforms trained models.

The first step in unlocking the value of unstructured healthcare data is efficient handling of Personally Identifiable Information. Accurately anonymizing medical data is challenging when data is in multiple languages, beyond English, and for short doctor notes that pose a different challenge from proper documents with paragraphs and sentences.

In this talk we show the data flow for probably the largest multi-country EMR data platform in the world, focusing on the de-identification module that allows to safely ingest doctor notes and open them up for analytics. We go into details of the Spark NLP de-identification model and pipeline built to handle German texts, share results, and summarize lessons learned, including combining rules and models for types of entities where a rules-based outperforms trained models.

About the speaker
Yanshan Wang

Maciej Piotrowski

Director, IT Architecture, Real World Solutions at IQVIA

Maciej is an Enterprise Architect responsible for defining the technology strategy for Real World Analytics Solutions organization with a focus on data systems.

Throughout his career Maciej has held several senior technical positions, including Enterprise Architect in healthcare domain, Chief Technologist in financial services sector and in a global services delivery organisation.

His focus and passion have always been in the areas of Architecture, Artificial Intelligence and Data and Analytics.

Yanshan Wang

Jiri Dobes

Head of Solutions at John Snow Labs

Jiri Dobes is the Head of Solutions in John Snow Labs. He has been leading the development of machine learning solutions in healthcare and other domains for the past five years. Jiri is a PMP certified project manager.

His previous experience includes delivering large projects in the power generation sector and consulting for the Boston Consulting Group and large pharma. Jiri holds a Ph.D. in mathematical modeling.

NLP-Summit

When

Sessions: April 5th – 6th 2022
Trainings: April 12th – 15th 2022

Contact

nlpsummit@johnsnowlabs.com

Presented by

jhonsnow_logo