Automating a Streaming Pipeline with OCR on Databricks Lakehouse
Health systems and payers are dealing with vast amounts of clinical documents that often are delivered as scanned images. Most organizations struggle to build a scalable pipeline despite operationally needing these documents on a daily basis.
In this talk, Amir demonstrates how to build and automate a clinical data pipeline with JSL Healthcare Solutions on Databricks Lakehouse Platform. This pipeline uses Databricks’ Auto Loader, which automates data ingestion into Delta Lake, by enabling organizations to incrementally ingest data.
The pipeline retrieves scanned images from object storage, converts the files to text, extracts clinical entities, and outputs the results to the same storage location in delta format, which can further be analyzed for a variety of clinical applications using Databricks SQL. All of this happens within a fully managed environment, simplifying the ETL process.
Technical Director for Health & Life Sciences at Databricks
Amir is the Technical Director for Health & Life Sciences at Databricks, where he focuses on developing advanced analytics solution accelerators to help health care and life sciences organizations in their data and AI journey.
Amir’s past positions include Sr. Data Scientist at Shopify, Sr. Staff Scientist at AncestryDNA, and Research Scholar in Human Genetics at the Howard Hughes Medical Institute. He holds a Ph.D. in Mathematical Biology, MA.Sc. in Electrical Engineering, and B.Sc. in Physics.