Skip to main content

Encoders

Pathling provides a set of libraries that can be used to transform data between FHIR (JSON or XML) and Apache Spark data sets. The encoders can be used from Python, Scala and Java.

We also have upcoming support for R, subscribe to this issue for updates.

Once your data is encoded as a Spark data set, it can be queried using SQL, or transformed using the full library of functions that Spark provides. It can also be written to Parquet and other formats that are compatible with a wide range of tools. See the Spark documentation for more details.

info

We also have upcoming support for R, subscribe to this issue for updates.

Reading in NDJSON

NDJSON is a format commonly used for bulk FHIR data, and consists of files (one per resource type) that contains one JSON resource per line.

To use the Pathling encoders from Python, install the pathling package using pip. Note that Java 11 or later is required, with yourJAVA_HOME properly set.

from pathling import PathlingContext

pc = PathlingContext.create()

# Read each line from the NDJSON into a row within a Spark data set.
ndjson_dir = '/some/path/ndjson/'
json_resources = pc.spark.read.text(ndjson_dir)

# Convert the data set of strings into a structured FHIR data set.
patients = pc.encode(json_resources, 'Patient')

# Do some stuff.
patients.select('id', 'gender', 'birthDate').show()

Reading in Bundles

The FHIR Bundle resource can contain a collection of FHIR resources. It is often used to represent a set of related resources, perhaps generated as part of the same event.

To use the Pathling encoders from Python, install the pathling package using pip. Note that Java 11 or later is required, with yourJAVA_HOME properly set.

from pathling import PathlingContext

pc = PathlingContext.create()

# Read each Bundle into a row within a Spark data set.
bundles_dir = '/some/path/bundles/'
bundles = pc.spark.read.text(bundles_dir, wholetext=True)

# Convert the data set of strings into a structured FHIR data set.
patients = pc.encode_bundle(bundles, 'Patient')

# JSON is the default format, XML Bundles can be encoded using input type.
# patients = pc.encodeBundle(bundles, 'Patient', inputType=MimeType.FHIR_XML)

# Do some stuff.
patients.select('id', 'gender', 'birthDate').show()

Installation in Databricks

To make the Pathling encoders available within notebooks, navigate to the "Compute" section and click on the cluster. Click on the "Libraries" tab, and click "Install new".

Install both the python PyPI package, and the au.csiro.pathling:library-api Maven package. Once the cluster is restarted, the libraries should be available for import and use within all notebooks.

See the Databricks documentation on Libraries for more information.

Spark cluster configuration

If you are running your own Spark cluster, or using a Docker image (such as jupyter/all-spark-notebook) , you will need to configure Pathling as a Spark package.

You can do this by adding the following to your spark-defaults.conf file:

spark.jars.packages au.csiro.pathling:library-api:[some version]

See the Configuration page of the Spark documentation for more information about spark.jars.packages and other related configuration options.

To create a Pathling notebook Docker image, your Dockerfile might look like this:

FROM jupyter/all-spark-notebook

USER root
RUN echo "spark.jars.packages au.csiro.pathling:library-api:[some version]" >> /usr/local/spark/conf/spark-defaults.conf

USER ${NB_UID}

RUN pip install --quiet --no-cache-dir pathling && \
fix-permissions "${CONDA_DIR}" && \
fix-permissions "/home/${NB_USER}"