Querying MIMIC-IV data
This article demonstrates how to extract and prepare clinical data from MIMIC-IV using Pathling. We use a clinical study on oxygen supplementation differences between racial groups as our example, focusing on the data preparation steps that transform raw healthcare records into analysis-ready datasets.
This work was originally published as part of the paper SQL on FHIR - Tabular views of FHIR data using FHIRPath published in npj Digital Medicine. The full code is available in the aehrc/sql-on-fhir-evaluation repository.
Introduction
We demonstrate these data extraction techniques using a study that examined whether patients from different racial and ethnic backgrounds receive different amounts of supplemental oxygen in intensive care units. This study provides an excellent example because it requires combining several types of clinical data: patient demographics, vital signs measurements, oxygen delivery records, and blood gas results.
Our data preparation process will extract:
- Patient demographic information including race and ethnicity
- Vital signs measurements, particularly oxygen saturation
- Oxygen flow rate measurements from respiratory equipment
- Blood gas analysis results showing oxygen levels in blood samples
Importing the MIMIC-IV dataset
The MIMIC-IV on FHIR dataset is provided in FHIR NDJSON format, and we can use the NDJSON reader in Pathling to load it into a set of Spark dataframes.
MIMIC-IV is available from Physionet. It comes in two variants:
- The full dataset (approximately 625M resources). This dataset requires credentialed access and the use must accept a data use agreement which includes a mandatory training course.
- A demo sample (approximately 1M resources).
Because MIMIC-IV uses a non-standard naming convention for its files, we need to provide a custom file name mapper to correctly identify the resource type for each file:
- Python
- R
data = pc.read.ndjson(
"/usr/share/staging/ndjson",
file_name_mapper=lambda file_name: re.findall(r"Mimic(\w+?)(?:ED|ICU|"
r"Chartevents|Datetimeevents|Labevents|MicroOrg|MicroSusc|MicroTest|"
r"Outputevents|Lab|Mix|VitalSigns|VitalSignsED)?$",
file_name))
library(sparklyr)
library(pathling)
pc <- pathling_connect()
data <- pc %>%
pathling_read_ndjson(
"/usr/share/staging/ndjson",
file_name_mapper = function(file_name) {
stringr::str_extract(file_name, "(?<=Mimic)\\w+?(?=ED|ICU|Chartevents|Datetimeevents|Labevents|MicroOrg|MicroSusc|MicroTest|Outputevents|Lab|Mix|VitalSigns|VitalSignsED|$)")
}
)
Understanding the data extraction approach
Layered data transformation
Data is first extracted into a set of intermediate views using SQL on FHIR view definitions. These views extract all relevant elements from the FHIR data.
Next, related measurements are combined into clinical concepts such as vital signs and oxygen delivery using SQL transformations.