Skip to main content
Version: 9.8.0

Spark configuration

Supported versions

Pathling is built and tested against Apache Spark 4.0.x. The Python library requires PySpark 4.0.x, and the R library requires sparklyr with Spark 4.0.x.

ANSI mode

Pathling evaluates FHIRPath expressions and ViewDefinitions identically and correctly whether spark.sql.ansi.enabled is set to true or false. Spark 4 enables ANSI mode by default, and Pathling works correctly on a stock session: a value that does not conform to a declared type, an out-of-range decimal, or a decimal arithmetic result that overflows yields an empty result for that value rather than aborting the query.

Because the setting is session-wide, you may keep it disabled (spark.sql.ansi.enabled=false) to support portable, vendor-neutral transform SQL that relies on lenient casts, without affecting how Pathling behaves. Pathling neither sets nor requires any particular value for this flag.

Session configuration

When you create a PathlingContext within your Spark application, it will detect the presence of an existing SparkSession and use it. If there is no existing session, it will create one for you with some sensible default configuration. You can override this default configuration by passing a SparkSession object to the PathlingContext constructor.

This can be useful if you want to set other Spark configuration, for example to increase the available memory.

The session that you provide must have the Pathling library API on the classpath. You can also optionally enable Delta Lake support. Here is an example of how to programmatically configure a session that has Delta enabled:

from pathling import PathlingContext
from pyspark.sql import SparkSession

spark = (
SparkSession.builder.config(
"spark.jars.packages",
"au.csiro.pathling:library-runtime:9.8.0," +
"io.delta:delta-spark_2.13:4.0.0"
)
.config(
"spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension"
)
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
).getOrCreate()
)

pc = PathlingContext.create(spark)

Cluster configuration

If you are running your own Spark cluster, or using a Docker image (such as jupyter/all-spark-notebook), you will need to configure Pathling as a Spark package.

You can do this by adding the following to your spark-defaults.conf file:

spark.jars.packages au.csiro.pathling:library-runtime:[some version]

See the Configuration page of the Spark documentation for more information about spark.jars.packages and other related configuration options.

To create a Pathling notebook Docker image, your Dockerfile might look like this:

FROM jupyter/all-spark-notebook

USER root
RUN echo "spark.jars.packages au.csiro.pathling:library-runtime:[some version]" >> /usr/local/spark/conf/spark-defaults.conf

USER ${NB_UID}

RUN pip install --quiet --no-cache-dir pathling && \
fix-permissions "${CONDA_DIR}" && \
fix-permissions "/home/${NB_USER}"