Accelerating Clinical Data Abstraction and Real-World Data Curation with Active Learning

Building large-scale structured datasets of detailed clinical information about patient journeys is a critical tool in medical research, clinical guideline development, and real-world evidence. It is used heavily to study everything from Cancer to Covid – but is also highly challenging because of the massive and specialized effort required to abstract data from noisy & unstructured datasets. Automating clinical data abstraction historically faced three challenges.

First, each project has different guidelines on what, how, and when data should be extracted and normalized.

Second, the data is often taken from natural language documents, or a combination of structured, imaging, and document sources.

Third, near-perfect accuracy is required to enable medical decision making – so models that achieve 90% accuracy, for example, are just not good enough.

In this session, Dr. Dia Trambitas will share an end-to-end, semi-automated system composed of Spark NLP for Healthcare as the underlying NLP engine, a team-based data annotation tool used by human specialists, and an active learning pipeline that automatically applies experts’ feedback to retrain models. This system achieved a 4x speedup in real-world data & clinical data abstraction projects, enabling an order of magnitude scaleup while retaining the accuracy achieved by a manual process.

About the speaker
Amy-Heineike

Dia Trambitas

Product Lead at John Snow Labs

Dia Trambitas is a computer scientist with a rich background in Natural Language Processing. She has a Ph.D. in Semantic Web from the University of Grenoble, France, where she worked on ways of describing spatial and temporal data using OWL ontologies and reasoning based on semantic annotations.

She then changed her interest to text processing and data extraction from unstructured documents, a subject she has been working on for the last 10 years.

She has rich experience working with different annotation tools and leading document classification and NER projects in verticals such as Finance, Investment, Banking, and Healthcare.