Skip to main content

FHIR encoders

The Pathling library can be used to transform FHIR Bundles or NDJSON into Spark data sets. Once your data is encoded, it can be queried using SQL, or transformed using the full library of functions that Spark provides. It can also be written to Parquet and other formats that are compatible with a wide range of tools. See the Spark documentation for more details.

Reading in NDJSON

NDJSON is a format commonly used for bulk FHIR data, and consists of files (one per resource type) that contains one JSON resource per line.

To use the Pathling encoders from Python, install the pathling package using pip. Note that Java 11 or later is required, with yourJAVA_HOME properly set.

from pathling import PathlingContext

pc = PathlingContext.create()

# Read each line from the NDJSON into a row within a Spark data set.
ndjson_dir = '/some/path/ndjson/'
json_resources = pc.spark.read.text(ndjson_dir)

# Convert the data set of strings into a structured FHIR data set.
patients = pc.encode(json_resources, 'Patient')

# Do some stuff.
patients.select('id', 'gender', 'birthDate').show()

Reading in Bundles

The FHIR Bundle resource can contain a collection of FHIR resources. It is often used to represent a set of related resources, perhaps generated as part of the same event.

To use the Pathling encoders from Python, install the pathling package using pip. Note that Java 11 or later is required, with yourJAVA_HOME properly set.

from pathling import PathlingContext

pc = PathlingContext.create()

# Read each Bundle into a row within a Spark data set.
bundles_dir = '/some/path/bundles/'
bundles = pc.spark.read.text(bundles_dir, wholetext=True)

# Convert the data set of strings into a structured FHIR data set.
patients = pc.encode_bundle(bundles, 'Patient')

# JSON is the default format, XML Bundles can be encoded using input type.
# patients = pc.encodeBundle(bundles, 'Patient', inputType=MimeType.FHIR_XML)

# Do some stuff.
patients.select('id', 'gender', 'birthDate').show()