Prostate cancer risk factors
This example demonstrates how to use SQL on FHIR views to query a set of risk factors from FHIR data for the purpose of identifying patients at a high-risk of being diagnosed with prostate cancer.
- Python
- R
from pathling import PathlingContext
from pyspark.sql.functions import to_date, to_timestamp, round, min, nth_value
from pyspark.sql.window import Window
library(pathling)
library(sparklyr)
library(dplyr)
As a first step, we will initialise the Pathling context.
You can call this with no arguments and it will either create a new Spark session with some sensible defaults, or pick up an existing Spark session.
- Python
- R
pc = PathlingContext.create()
pc.spark.sql("CREATE SCHEMA IF NOT EXISTS paper_lg")
pc.spark.catalog.setCurrentDatabase("paper_lg")
pc <- pathling_connect()
DBI::dbExecute(pc, "CREATE SCHEMA IF NOT EXISTS paper_lg")
DBI::dbExecute(pc, "USE paper_lg")
Read data from tables
Now we will read some data from a set of previously persisted Delta tables. The data we are using here is a Synthea dataset containing approximately 10,000 patients.
The object returned is a "data source", which contains each of the data frames that have been encoded, as well as methods to run queries over them.
- Python
- R
data = pc.read.tables()
data <- pc %>% pathling_from_tables()
Here are the resources that are contained within our new data source:
- Python
- R
data.resource_types()
data %>% ds_resource_types()
['Condition',
'Immunization',
'MedicationRequest',
'Organization',
'CareTeam',
'Device',
'DiagnosticReport',
'ImagingStudy',
'Patient',
'ExplanationOfBenefit',
'Practitioner',
'AllergyIntolerance',
'CarePlan',
'SupplyDelivery',
'Observation',
'Encounter',
'MedicationAdministration',
'Procedure',
'Claim']
Here are the row counts for some of the resources:
- Python
- R
data.read('Patient').count()
11515
data.read('Observation').count()
2881995
data.read('Condition').count()
373288
data %>% ds_read('Patient') %>% sdf_nrow()
11515
data %>% ds_read('Observation') %>% sdf_nrow()
2881995
data %>% ds_read('Condition') %>% sdf_nrow()
373288
Prostate cancer risk factors
To demonstrate the SQL on FHIR query functionality, we will describe a simple scenario that involves the extraction of data to support an analysis of prostate cancer risk factors.
We will extract some patient demographic data along with some candidate risk factors:
- Hyperlipidemia (recorded total cholesterol > 240 mg/dL)
- High BMI (recorded BMI > 30 kg/m2)
Our target view will be a table containing one row per patient, with the following columns:
- Patient ID
- Birth date
- Postal code
- Deceased status
- Prostate cancer onset
- High BMI observed date
- Hyperlipidemia observed date
- Total cholesterol (mg/dL)
In order to construct this view, we will create four different views, and then compose them together using regular Spark SQL:
- Patient demographics
- Hyperlipidemia observations
- High BMI observations
- Prostate cancer diagnoses
Patient demographic view
This view will provide the following data elements:
- Patient ID
- Birth date
- Postal code
- Deceased status
Patient ID will be extracted using the getResourceKey()
function. This returns
an identifier that can be used along with the getReferenceKey()
function to
join the related rows from different views together.
Birth date and deceased time are pretty straightforward, and are singular so do not require any kind of unnesting.
Postal code returns multiple values, as a patient can have more than one
address. In this example we have chosen to keep this as an array, by setting the
collection
element to true
.
- Python
- R
patients = data.view(
"Patient",
select=[
{
"column": [
{
"description": "Patient ID",
"path": "getResourceKey()",
"name": "id",
},
{
"description": "Birth date",
"path": "birthDate",
"name": "birth_date",
},
{
"description": "Postal code",
"path": "address.postalCode",
"name": "postal_code",
"collection": True
},
{
"description": "Deceased time",
"path": "deceased.ofType(dateTime)",
"name": "deceased",
},
]
},
],
)
patients <- data %>%
ds_view(
"Patient",
select = list(
column = list(
list(
description = "Patient ID",
path = "getResourceKey()",
name = "id"
),
list(
description = "Birth date",
path = "birthDate",
name = "birth_date"
),
list(
description = "Postal code",
path = "address.postalCode",
name = "postal_code",
collection = TRUE
),
list(
description = "Deceased time",
path = "deceased.ofType(dateTime)",
name = "deceased"
)
)
)
)
- Python
- R
patients.show(10, truncate=False)
patients %>% head(10) %>% collect()
id | birth_date | postal_code | deceased |
---|---|---|---|
Patient/801faac2-595a-3274-8a98-f6d78774310f | 1982-10-24 | [02128] | NULL |
Patient/801fb251-3ae3-8d3b-2519-89fb92592701 | 1931-03-28 | [02114] | NULL |
Patient/8024510b-8666-2688-08f1-6272e64ce607 | 2008-10-11 | [02780] | NULL |
Patient/80333a44-525d-5b2e-f86f-f49ba6ff22fb | 1957-07-27 | [02118] | 1987-06-15T08:00:45+10:00 |
Patient/80379569-fe92-82c5-5ca7-f525414f6adb | 1987-02-04 | [02148] | NULL |
Patient/803a5acc-45ab-66a4-e4d0-5ef7f035d4ab | 1982-04-15 | [] | NULL |
Patient/803a8a7d-c4ab-c281-3660-2e998737187a | 1913-10-03 | [02132] | NULL |
Patient/8051dd33-ab42-a906-1b43-08e869a6c45e | 2002-08-07 | [] | NULL |
Patient/80593f71-36be-a9da-7ada-4fc756caca86 | 2010-11-01 | [01604] | NULL |
Patient/8064f6c5-77d3-9652-10ca-332d2b80b078 | 1998-10-28 | [02118] | NULL |
Hyperlipidemia view
This view will provide the following data elements:
- Observation ID
- Patient ID
- Observation date
- Observation code
- Total cholesterol unit
- Total cholesterol value
The view will be filtered to only total cholesterol observations that exceed 240 mg/dL.
- Python
- R
cholesterol = data.view(
"Observation",
select=[
{
"column": [
{
"description": "Observation ID",
"path": "getResourceKey()",
"name": "id",
},
{
"description": "Patient ID",
"path": "subject.getReferenceKey()",
"name": "patient_id",
},
{
"description": "Observation date",
"path": "effective.ofType(dateTime)",
"name": "date",
},
],
"select": [
{
"forEach": "code.coding",
"column": [
{
"description": "Observation code",
"path": "code",
"name": "code",
},
],
},
{
"forEach": "value.ofType(Quantity)",
"column": [
{
"description": "Total cholesterol unit",
"path": "unit",
"name": "unit",
},
{
"description": "Total cholesterol value",
"path": "value",
"name": "value",
},
],
},
],
}
],
where=[
{
"description": "Total cholesterol > 240 mg/dL",
"path": "where(code.coding.exists(system = 'http://loinc.org'"
"and code = '2093-3'))"
".value.ofType(Quantity).where(code = 'mg/dL')"
".value > 240",
}
],
)
cholesterol <- data %>%
ds_view(
"Observation",
select = list(
column = list(
list(
description = "Observation ID",
path = "getResourceKey()",
name = "id"
),
list(
description = "Patient ID",
path = "subject.getReferenceKey()",
name = "patient_id"
),
list(
description = "Observation date",
path = "effective.ofType(dateTime)",
name = "date"
)
),
select = list(
list(
forEach = "code.coding",
column = list(
list(
description = "Observation code",
path = "code",
name = "code"
)
)
),
list(
forEach = "value.ofType(Quantity)",
column = list(
list(
description = "Total cholesterol unit",
path = "unit",
name = "unit"
),
list(
description = "Total cholesterol value",
path = "value",
name = "value"
)
)
)
)
),
where = list(
list(
description = "Total cholesterol > 240 mg/dL",
path = paste0(
"where(code.coding.exists(system = 'http://loinc.org'",
"and code = '2093-3'))",
".value.ofType(Quantity).where(code = 'mg/dL')",
".value > 240"
)
)
)
)
- Python
- R
cholesterol.show(10, truncate=False)
cholesterol %>% head(10) %>% collect()
id | patient_id | date | code | unit | value |
---|---|---|---|---|---|
Observation/a9d28e2b-fc1e-dc37-d2b8-54f2d7e1fce2 | Patient/68d99a0d-35f4-e8cb-35d0-eb54e3dd480a | 2018-12-16T19:09:44+10:00 | 2093-3 | mg/dL | 250.69 |
Observation/a9deed27-9d92-88a8-b3fc-abd952492300 | Patient/9f9aa9cc-b79f-b05a-baf7-4ca3a6263b9e | 2012-12-13T18:16:12+10:00 | 2093-3 | mg/dL | 256.12 |
Observation/aa177299-7073-063e-9bb3-ebd0c2d5aa5d | Patient/fa615cec-e629-3f65-b867-749101680d61 | 2020-01-18T16:42:38+10:00 | 2093-3 | mg/dL | 248.32 |
Observation/aa454888-bc42-bafb-25db-cc33cad48fd8 | Patient/af0eb7c8-c910-2a57-50c3-ccc45cc36eaf | 2017-11-24T14:04:33+10:00 | 2093-3 | mg/dL | 272.96 |
Observation/aac208e1-7989-915f-368b-c37a8ad9fe1b | Patient/bea6b76e-4d88-9a9f-2d79-dc15000e3f21 | 2015-02-21T03:44:57+10:00 | 2093-3 | mg/dL | 241.76 |
Observation/abc50b62-8502-d66a-5a49-d3b50f21a68f | Patient/139e597e-1f4d-a261-7958-555b7f4ffd80 | 2012-06-21T02:25:09+10:00 | 2093-3 | mg/dL | 240.75 |
Observation/abfcffcf-5aba-3b9a-f521-907f2f248eb4 | Patient/004d1fc0-8eb8-cc45-b3ea-4495c0b16d50 | 1972-11-15T09:17:20+10:00 | 2093-3 | mg/dL | 247.27 |
Observation/ac009d08-fc3d-b5a0-57dc-45e7555e1e5d | Patient/ff7478da-d888-bc48-b636-d5233f8de938 | 2018-03-31T03:35:07+10:00 | 2093-3 | mg/dL | 251.84 |
Observation/ac50aafa-7043-f900-e5a3-aa25d8311cf3 | Patient/b52d7e4f-632c-49cd-00f7-3b5b7d85f8ba | 2013-05-20T11:47:21+10:00 | 2093-3 | mg/dL | 291.25 |
Observation/ac5efce4-ce67-eb23-338c-2e64961794c3 | Patient/b6dc56c4-91e9-2880-a8a0-5a438d7e19ff | 2012-08-02T08:17:10+10:00 | 2093-3 | mg/dL | 261.25 |
BMI view
This view will provide the following data elements:
- Observation ID
- Patient ID
- Observation date
- Observation code
- BMI unit
- BMI value
The view will be filtered to only BMI observations that exceed 30 kg/m2.
One of the nice things about the Pathling implementation is that the encoding process includes canonicalisation of UCUM units. This means that the comparison of total cholesterol values in this query will also pick up observations that were made with different but comparable units, such as mg/dL.
- Python
- R
bmi = data.view(
"Observation",
select=[
{
"column": [
{
"description": "Observation ID",
"path": "getResourceKey()",
"name": "id",
},
{
"description": "Patient ID",
"path": "subject.getReferenceKey()",
"name": "patient_id",
},
{
"description": "Observation date",
"path": "effective.ofType(dateTime)",
"name": "date",
},
],
"select": [
{
"forEach": "code.coding",
"column": [
{
"description": "Observation code",
"path": "code",
"name": "code",
},
],
},
{
"forEach": "value.ofType(Quantity)",
"column": [
{
"description": "BMI unit",
"path": "unit",
"name": "unit",
},
{
"description": "BMI value",
"path": "value",
"name": "value",
},
],
},
],
}
],
where=[
{
"description": "BMI > 30 kg/m2",
"path": "where(code.coding.exists(system = 'http://loinc.org'"
"and code = '39156-5'))"
".value.ofType(Quantity).where(code = 'kg/m2')"
".value > 30",
}
],
)
bmi <- data %>%
ds_view(
"Observation",
select = list(
column = list(
list(
description = "Observation ID",
path = "getResourceKey()",
name = "id"
),
list(
description = "Patient ID",
path = "subject.getReferenceKey()",
name = "patient_id"
),
list(
description = "Observation date",
path = "effective.ofType(dateTime)",
name = "date"
)
),
select = list(
list(
forEach = "code.coding",
column = list(
list(
description = "Observation code",
path = "code",
name = "code"
)
)
),
list(
forEach = "value.ofType(Quantity)",
column = list(
list(
description = "BMI unit",
path = "unit",
name = "unit"
),
list(
description = "BMI value",
path = "value",
name = "value"
)
)
)
)
),
where = list(
list(
description = "BMI > 30 kg/m2",
path = paste0(
"where(code.coding.exists(system = 'http://loinc.org'",
"and code = '39156-5'))",
".value.ofType(Quantity).where(code = 'kg/m2')",
".value > 30"
)
)
)
)
- Python
- R
bmi.show(10, truncate=False)
bmi %>% head(10) %>% collect()
id | patient_id | date | code | unit | value |
---|---|---|---|---|---|
Observation/a96ce215-8a06-cd1d-639a-0e06fa771ee0 | Patient/5c1d48a9-e9c0-4b62-5384-ebe0fd0dc3bc | 2020-02-04T17:53:45+10:00 | 39156-5 | kg/m2 | 33.22 |
Observation/a96d3264-4e89-ac70-ad85-32116cd06392 | Patient/e99277b9-a765-e168-8780-2dcf54a43efc | 2004-10-29T08:28:40+10:00 | 39156-5 | kg/m2 | 30.02 |
Observation/a974a0b6-4914-0168-bf12-d852775d621c | Patient/a05ad089-2587-1e64-2c5f-6e855a8baa94 | 2015-04-09T22:24:40+10:00 | 39156-5 | kg/m2 | 30.09 |
Observation/a9752afb-d88b-cd1d-7aa8-bd954a0f9539 | Patient/de01b661-e58b-aac1-ea38-35a91690d21c | 2012-12-13T12:04:10+10:00 | 39156-5 | kg/m2 | 31.01 |
Observation/a978697c-6fff-c177-8a13-67b42f9846d1 | Patient/eab52e11-e0ee-fc45-9a00-9dee0f0f0447 | 2018-08-28T11:04:20+10:00 | 39156-5 | kg/m2 | 30.37 |
Observation/a9793ea2-a7ff-5f49-d0ae-1fbf265f482f | Patient/2b206983-2d3a-2e07-77ac-5e49f9fdc4d3 | 2016-04-03T08:26:46+10:00 | 39156-5 | kg/m2 | 32.84 |
Observation/a97bf446-27cc-235d-f886-209ca752aecd | Patient/d033196c-eaa6-5eaa-fd63-9cf3cd29ac7a | 2013-05-06T08:11:47+10:00 | 39156-5 | kg/m2 | 30.03 |
Observation/a9803b33-61fe-f8e7-460c-7ff52493ef1a | Patient/47f13b5c-3785-8897-c19f-cfae4c545362 | 2019-10-17T20:23:26+10:00 | 39156-5 | kg/m2 | 30.34 |
Observation/a986bd7f-f1b8-0291-5543-c43973c9450b | Patient/fec02692-95ba-f8f2-cca4-f0f47ecaa48f | 1972-10-07T11:50:23+10:00 | 39156-5 | kg/m2 | 30.1 |
Observation/a988cd5b-2a54-2be7-94bd-614733fe08ab | Patient/615abef5-4c9d-a53d-e6b3-bc5c45cb2cb6 | 1994-01-06T01:48:39+10:00 | 39156-5 | kg/m2 | 30.41 |
Prostate cancer diagnosis view
This view will provide the following data elements:
- Condition ID
- Patient ID
- Condition onset date
- Condition code
The view will be filtered to only prostate cancer diagnoses.
- Python
- R
prostate_cancer_diagnoses = data.view(
"Condition",
select=[
{
"column": [
{
"description": "Condition ID",
"path": "getResourceKey()",
"name": "id",
},
{
"description": "Patient ID",
"path": "subject.getReferenceKey()",
"name": "patient_id",
},
{
"description": "SNOMED CT diagnosis code",
"path": "code.coding.where(system = 'http://snomed.info/sct').code",
"name": "sct_id",
},
{
"description": "Date of onset",
"path": "onsetDateTime",
"name": "onset",
},
]
}
],
where=[
{
"description": "Neoplasm of prostate",
"path": "code.coding.exists(system = 'http://snomed.info/sct'"
"and code = '126906006')",
}
],
)
prostate_cancer_diagnoses <- data %>%
ds_view(
"Condition",
select = list(
column = list(
list(
description = "Condition ID",
path = "getResourceKey()",
name = "id"
),
list(
description = "Patient ID",
path = "subject.getReferenceKey()",
name = "patient_id"
),
list(
description = "SNOMED CT diagnosis code",
path = "code.coding.where(system = 'http://snomed.info/sct').code",
name = "sct_id"
),
list(
description = "Date of onset",
path = "onsetDateTime",
name = "onset"
)
)
),
where = list(
list(
description = "Neoplasm of prostate",
path = paste0(
"code.coding.exists(system = 'http://snomed.info/sct'",
"and code = '126906006')"
)
)
)
)
- Python
- R
prostate_cancer_diagnoses.show(10, truncate=False)
prostate_cancer_diagnoses %>%
head(10) %>%
collect()
id | patient_id | sct_id | onset |
---|---|---|---|
Condition/828b84ef-2de1-520e-9932-801c7ddfc307 | Patient/2e071829-78d6-6567-b5ce-c187c6cf8d9b | 126906006 | 2021-09-29T22:28:07+10:00 |
Condition/840ab28b-70c2-92f1-38dc-7700416b82f9 | Patient/ea20b941-d402-2a76-7f7c-1adee8d14f2a | 126906006 | 1992-01-21T14:02:25+11:00 |
Condition/847dfa4e-2b29-69b7-dbb4-9e5581480b6c | Patient/ee0db189-5938-08c2-880c-5c8cb1524540 | 126906006 | 2015-09-02T04:02:35+10:00 |
Condition/85586193-fd10-0d27-aeed-e9cdeb8a7c8e | Patient/11a0b1b5-ce48-19a6-c20f-5f6da2749850 | 126906006 | 2003-11-22T07:15:17+10:00 |
Condition/85949701-25ea-6ae0-51f5-fed415e3a472 | Patient/e2690804-78d4-0325-4e50-3f3940c3e220 | 126906006 | 1973-01-22T19:08:02+10:00 |
Condition/8ae7698d-f130-2ec9-b116-8a316d9f1fa1 | Patient/e2dd0d99-4f66-65d1-f82e-2cc5a494b162 | 126906006 | 1987-06-13T18:53:06+10:00 |
Condition/8b5961ce-acae-e0f6-8c1c-201f069e6aaa | Patient/4bac725c-f7de-7d84-317f-a91e89338578 | 126906006 | 1984-08-24T21:33:45+10:00 |
Condition/8ba6382c-4c55-a174-de38-15d66b66007f | Patient/572840df-38ac-8303-02b5-320e850565bc | 126906006 | 1999-11-27T20:47:40+10:00 |
Condition/8bc77d24-14f2-de4a-68e5-778dad91f30b | Patient/981164bb-bce7-9438-1bf4-ded5422f8200 | 126906006 | 1977-12-17T05:13:30+10:00 |
Condition/8c1b6c7e-dd88-5042-4e50-816e8d30ff09 | Patient/ea68f9e2-f1a1-b920-e0c3-e2671d5ba987 | 126906006 | 1991-04-06T08:05:13+10:00 |
Final view composition
Now that we have created the four subsidiary views, we can compose them together to create our final view.
First we will create a new column on the prostate cancer diagnosis view to get the first date of diagnosis (as there could be multiple).
- Python
- R
prostate_cancer_diagnoses = prostate_cancer_diagnoses.withColumn(
"latest_onset",
min(to_timestamp(prostate_cancer_diagnoses.onset)).over(
Window.partitionBy(prostate_cancer_diagnoses.patient_id)
),
)
prostate_cancer_diagnoses.show(10, truncate=False)
prostate_cancer_diagnoses <- prostate_cancer_diagnoses %>%
group_by(patient_id) %>%
mutate(latest_onset = min(as_datetime(onset))) %>%
ungroup()
prostate_cancer_diagnoses %>%
head(10) %>%
collect()
id | patient_id | sct_id | onset | latest_onset |
---|---|---|---|---|
Condition/f9d41b9a-7044-304b-ad86-8fea1854711b | Patient/00c813fb-1a8c-2310-008e-e57dab9e0698 | 126906006 | 2000-12-07T03:43:57+10:00 | 2000-12-06 17:43:57 |
Condition/16b1286f-e803-7941-f18e-000212f44e81 | Patient/00fe87cb-6692-5eb6-0770-67fda7c7f1e8 | 126906006 | 1982-02-28T08:41:54+10:00 | 1982-02-27 22:41:54 |
Condition/a2b879de-2ddc-a9a7-d231-997e19fd6cf4 | Patient/0141f80a-700b-75a7-3f7f-91c788096fa8 | 126906006 | 1990-11-11T17:10:28+11:00 | 1990-11-11 06:10:28 |
Condition/5efcf6ad-c1cd-de25-5500-3a8deac5ffa3 | Patient/01b4682f-47b4-b738-9ba1-e5f5506ae3ce | 126906006 | 2002-08-17T14:34:48+10:00 | 2002-08-17 04:34:48 |
Condition/2397de38-8cd4-aab0-81f9-b7a06aa23918 | Patient/0203a0e5-99cf-6aa5-be69-b5adb47d709d | 126906006 | 1984-01-01T17:34:39+10:00 | 1984-01-01 07:34:39 |
Condition/dd9fc0e5-57d2-58b4-9f0e-d20cd8838e45 | Patient/021f7437-04ed-40ce-9c49-5d940c8ae817 | 126906006 | 2006-06-04T08:03:25+10:00 | 2006-06-03 22:03:25 |
Condition/6124f3ab-32aa-1560-82a8-72fb7af1691c | Patient/0417532a-96bf-2dcd-33c2-63829dc5f236 | 126906006 | 1976-02-23T10:18:23+10:00 | 1976-02-23 00:18:23 |
Condition/0393fd53-1842-f9fc-e01a-ac63a5ce26be | Patient/06e0904b-8986-0c72-48bd-6313c723998a | 126906006 | 1975-07-20T16:07:42+10:00 | 1975-07-20 06:07:42 |
Condition/53513614-78f9-51a1-f2cc-6cf49b86efe7 | Patient/07bf2136-666c-7236-e9d8-442fa423d22a | 126906006 | 2018-03-05T11:57:53+10:00 | 2018-03-05 01:57:53 |
Condition/d5ec5877-3b01-702f-41d8-5eb37c9d6c48 | Patient/07cdcd2f-d848-f050-2022-a2b3184d5df6 | 126906006 | 2004-03-31T09:51:46+10:00 | 2004-03-30 23:51:46 |
Then we can join the diagnosis view to the patient demographics.
- Python
- R
patient_diagnoses = patients.join(
prostate_cancer_diagnoses,
patients.id == prostate_cancer_diagnoses.patient_id,
"inner",
)
patient_diagnoses.show(10, truncate=False)
patient_diagnoses <- patients %>%
inner_join(
prostate_cancer_diagnoses,
by = c("id" = "patient_id")
)
patient_diagnoses %>% head(10) %>% collect()
id | birth_date | postal_code | deceased | id | patient_id | sct_id | onset | latest_onset |
---|---|---|---|---|---|---|---|---|
Patient/00c813fb-1a8c-2310-008e-e57dab9e0698 | 1925-07-22 | [02188] | 2005-08-16T09:10:57+10:00 | Condition/f9d41b9a-7044-304b-ad86-8fea1854711b | Patient/00c813fb-1a8c-2310-008e-e57dab9e0698 | 126906006 | 2000-12-07T03:43:57+10:00 | 2000-12-06 17:43:57 |
Patient/00fe87cb-6692-5eb6-0770-67fda7c7f1e8 | 1921-01-01 | [01027] | 1985-05-17T13:13:54+10:00 | Condition/16b1286f-e803-7941-f18e-000212f44e81 | Patient/00fe87cb-6692-5eb6-0770-67fda7c7f1e8 | 126906006 | 1982-02-28T08:41:54+10:00 | 1982-02-27 22:41:54 |
Patient/0141f80a-700b-75a7-3f7f-91c788096fa8 | 1921-07-30 | [01938] | 1993-07-02T21:02:28+10:00 | Condition/a2b879de-2ddc-a9a7-d231-997e19fd6cf4 | Patient/0141f80a-700b-75a7-3f7f-91c788096fa8 | 126906006 | 1990-11-11T17:10:28+11:00 | 1990-11-11 06:10:28 |
Patient/01b4682f-47b4-b738-9ba1-e5f5506ae3ce | 1930-04-12 | [] | 2007-01-05T02:11:00+10:00 | Condition/5efcf6ad-c1cd-de25-5500-3a8deac5ffa3 | Patient/01b4682f-47b4-b738-9ba1-e5f5506ae3ce | 126906006 | 2002-08-17T14:34:48+10:00 | 2002-08-17 04:34:48 |
Patient/0203a0e5-99cf-6aa5-be69-b5adb47d709d | 1913-01-12 | [] | NULL | Condition/2397de38-8cd4-aab0-81f9-b7a06aa23918 | Patient/0203a0e5-99cf-6aa5-be69-b5adb47d709d | 126906006 | 1984-01-01T17:34:39+10:00 | 1984-01-01 07:34:39 |
Patient/021f7437-04ed-40ce-9c49-5d940c8ae817 | 1941-03-15 | [01752] | 2020-03-15T23:31:34+10:00 | Condition/dd9fc0e5-57d2-58b4-9f0e-d20cd8838e45 | Patient/021f7437-04ed-40ce-9c49-5d940c8ae817 | 126906006 | 2006-06-04T08:03:25+10:00 | 2006-06-03 22:03:25 |
Patient/0417532a-96bf-2dcd-33c2-63829dc5f236 | 1915-04-26 | [01938] | NULL | Condition/6124f3ab-32aa-1560-82a8-72fb7af1691c | Patient/0417532a-96bf-2dcd-33c2-63829dc5f236 | 126906006 | 1976-02-23T10:18:23+10:00 | 1976-02-23 00:18:23 |
Patient/06e0904b-8986-0c72-48bd-6313c723998a | 1912-05-11 | [] | NULL | Condition/0393fd53-1842-f9fc-e01a-ac63a5ce26be | Patient/06e0904b-8986-0c72-48bd-6313c723998a | 126906006 | 1975-07-20T16:07:42+10:00 | 1975-07-20 06:07:42 |
Patient/07bf2136-666c-7236-e9d8-442fa423d22a | 1940-10-06 | [] | 2021-12-17T20:47:00+10:00 | Condition/53513614-78f9-51a1-f2cc-6cf49b86efe7 | Patient/07bf2136-666c-7236-e9d8-442fa423d22a | 126906006 | 2018-03-05T11:57:53+10:00 | 2018-03-05 01:57:53 |
Patient/07cdcd2f-d848-f050-2022-a2b3184d5df6 | 1925-10-21 | [02536] | NULL | Condition/d5ec5877-3b01-702f-41d8-5eb37c9d6c48 | Patient/07cdcd2f-d848-f050-2022-a2b3184d5df6 | 126906006 | 2004-03-31T09:51:46+10:00 | 2004-03-30 23:51:46 |
For the hyperlipidemia view, the first thing we will do is join it to the patient details and diagnoses - but limiting the target rows to only those observed before the date of diagnosis.
- Python
- R
cholesterol = cholesterol.withColumn("timestamp",
to_timestamp(cholesterol.date))
with_cholesterol = (patient_diagnoses.join(
cholesterol,
(patient_diagnoses.patient_id == cholesterol.patient_id)
& (patient_diagnoses.latest_onset > cholesterol.timestamp),
"left_outer"
))
cholesterol <- cholesterol %>%
mutate(timestamp = as_datetime(date))
with_cholesterol <- patient_diagnoses %>%
left_join(
cholesterol,
by = c("patient_id" = "patient_id"),
sql_on = "patient_diagnoses.latest_onset > cholesterol.timestamp"
)
Then we will add new columns to the table to get the latest cholesterol observation date and value for each patient (prior to the diagnosis).
- Python
- R
window = Window.partitionBy(patient_diagnoses.patient_id).orderBy(
cholesterol.timestamp.desc())
with_cholesterol = with_cholesterol.withColumn(
"latest_cholesterol_date",
nth_value(cholesterol.timestamp, 1).over(window)
).withColumn(
"latest_cholesterol_value",
nth_value(cholesterol.value, 1).over(window)
).withColumn(
"latest_cholesterol_unit",
nth_value(cholesterol.unit, 1).over(window)
)
with_cholesterol <- with_cholesterol %>%
group_by(patient_id) %>%
arrange(desc(timestamp)) %>%
mutate(
latest_cholesterol_date = first(timestamp),
latest_cholesterol_value = first(value),
latest_cholesterol_unit = first(unit)
) %>%
ungroup()
Then we will select just the columns we need so far, ready for joining to the BMI view.
- Python
- R
with_cholesterol = with_cholesterol.select(
patients.id.alias("patient_id"),
patients.birth_date,
patients.postal_code,
patients.deceased,
prostate_cancer_diagnoses.latest_onset.alias("prostate_cancer_onset"),
with_cholesterol.latest_cholesterol_date,
with_cholesterol.latest_cholesterol_value,
with_cholesterol.latest_cholesterol_unit,
)
with_cholesterol <- with_cholesterol %>%
select(
patient_id = id,
birth_date,
postal_code,
deceased,
prostate_cancer_onset = latest_onset,
latest_cholesterol_date,
latest_cholesterol_value,
latest_cholesterol_unit
)
We'll do something similar with the BMI to the cholesterol - joining only to prior observations.
- Python
- R
bmi = bmi.withColumn("timestamp", to_timestamp(bmi.date))
with_bmi = (with_cholesterol.join(
bmi,
(with_cholesterol.patient_id == bmi.patient_id)
& (with_cholesterol.prostate_cancer_onset > bmi.timestamp),
"left_outer"
))
bmi <- bmi %>%
mutate(timestamp = as_datetime(date))
with_bmi <- with_cholesterol %>%
left_join(
bmi,
by = c("patient_id" = "patient_id"),
sql_on = "with_cholesterol.prostate_cancer_onset > bmi.timestamp"
)
And then we'll add new columns to get the latest BMI observation date and value for each patient (prior to the diagnosis).
- Python
- R
window = Window.partitionBy(with_cholesterol.patient_id).orderBy(
bmi.timestamp.desc())
with_bmi = with_bmi.withColumn(
"latest_bmi_date",
nth_value(bmi.timestamp, 1).over(window)
).withColumn(
"latest_bmi_value",
nth_value(bmi.value, 1).over(window)
).withColumn(
"latest_bmi_unit",
nth_value(bmi.unit, 1).over(window)
)
with_bmi <- with_bmi %>%
group_by(patient_id) %>%
arrange(desc(timestamp)) %>%
mutate(
latest_bmi_date = first(timestamp),
latest_bmi_value = first(value),
latest_bmi_unit = first(unit)
) %>%
ungroup()
The view can be finished off with a final selection of the desired columns.
- Python
- R
result = with_bmi.select(
with_cholesterol.patient_id,
to_date(with_bmi.birth_date).alias("birth_date"),
with_bmi.postal_code,
with_bmi.deceased,
with_bmi.prostate_cancer_onset,
with_bmi.latest_bmi_date.alias("high_bmi_observed"),
round(with_bmi.latest_bmi_value, 2).alias("bmi_value"),
with_bmi.latest_bmi_unit.alias("bmi_unit"),
with_bmi.latest_cholesterol_date.alias("hyperlipidemia_observed"),
with_bmi.latest_cholesterol_value.alias("total_cholesterol_value"),
with_bmi.latest_cholesterol_unit.alias("total_cholesterol_unit"),
)
result <- with_bmi %>%
select(
patient_id,
birth_date = as_date(birth_date),
postal_code,
deceased,
prostate_cancer_onset,
high_bmi_observed = latest_bmi_date,
bmi_value = round(latest_bmi_value, 2),
bmi_unit = latest_bmi_unit,
hyperlipidemia_observed = latest_cholesterol_date,
total_cholesterol_value = latest_cholesterol_value,
total_cholesterol_unit = latest_cholesterol_unit
)
- Python
- R
result.printSchema()
root
|-- patient_id: string (nullable = true)
|-- birth_date: date (nullable = true)
|-- postal_code: array (nullable = true)
| |-- element: string (containsNull = true)
|-- deceased: string (nullable = true)
|-- prostate_cancer_onset: timestamp (nullable = true)
|-- high_bmi_observed: timestamp (nullable = true)
|-- bmi_value: double (nullable = true)
|-- bmi_unit: string (nullable = true)
|-- hyperlipidemia_observed: timestamp (nullable = true)
|-- total_cholesterol_value: string (nullable = true)
|-- total_cholesterol_unit: string (nullable = true)
result.show(10, truncate=False)
result %>% glimpse()
Columns: 11
$ patient_id <chr>
$ birth_date <date>
$ postal_code <list>
$ deceased <chr>
$ prostate_cancer_onset <dttm>
$ high_bmi_observed <dttm>
$ bmi_value <dbl>
$ bmi_unit <chr>
$ hyperlipidemia_observed <dttm>
$ total_cholesterol_value <chr>
$ total_cholesterol_unit <chr>
result %>% head(10) %>% collect()
patient_id | birth_date | postal_code | deceased | prostate_cancer_onset | high_bmi_observed | bmi_value | bmi_unit | hyperlipidemia_observed | total_cholesterol_value | total_cholesterol_unit |
---|---|---|---|---|---|---|---|---|---|---|
Patient/00c813fb-1a8c-2310-008e-e57dab9e0698 | 1925-07-22 | [02188] | 2005-08-16T09:10:57+10:00 | 2000-12-06 17:43:57 | NULL | NULL | NULL | 1996-11-19 17:05:24 | 245.63 | mg/dL |
Patient/00fe87cb-6692-5eb6-0770-67fda7c7f1e8 | 1921-01-01 | [01027] | 1985-05-17T13:13:54+10:00 | 1982-02-27 22:41:54 | 1981-02-27 22:05:51 | 30.06 | kg/m2 | NULL | NULL | NULL |
Patient/00fe87cb-6692-5eb6-0770-67fda7c7f1e8 | 1921-01-01 | [01027] | 1985-05-17T13:13:54+10:00 | 1982-02-27 22:41:54 | 1981-02-27 22:05:51 | 30.06 | kg/m2 | NULL | NULL | NULL |
Patient/00fe87cb-6692-5eb6-0770-67fda7c7f1e8 | 1921-01-01 | [01027] | 1985-05-17T13:13:54+10:00 | 1982-02-27 22:41:54 | 1981-02-27 22:05:51 | 30.06 | kg/m2 | NULL | NULL | NULL |
Patient/00fe87cb-6692-5eb6-0770-67fda7c7f1e8 | 1921-01-01 | [01027] | 1985-05-17T13:13:54+10:00 | 1982-02-27 22:41:54 | 1981-02-27 22:05:51 | 30.06 | kg/m2 | NULL | NULL | NULL |
Patient/00fe87cb-6692-5eb6-0770-67fda7c7f1e8 | 1921-01-01 | [01027] | 1985-05-17T13:13:54+10:00 | 1982-02-27 22:41:54 | 1981-02-27 22:05:51 | 30.06 | kg/m2 | NULL | NULL | NULL |
Patient/0141f80a-700b-75a7-3f7f-91c788096fa8 | 1921-07-30 | [01938] | 1993-07-02T21:02:28+10:00 | 1990-11-11 06:10:28 | 1987-10-31 05:30:55 | 31.35 | kg/m2 | NULL | NULL | NULL |
Patient/0141f80a-700b-75a7-3f7f-91c788096fa8 | 1921-07-30 | [01938] | 1993-07-02T21:02:28+10:00 | 1990-11-11 06:10:28 | 1987-10-31 05:30:55 | 31.35 | kg/m2 | NULL | NULL | NULL |
Patient/0141f80a-700b-75a7-3f7f-91c788096fa8 | 1921-07-30 | [01938] | 1993-07-02T21:02:28+10:00 | 1990-11-11 06:10:28 | 1987-10-31 05:30:55 | 31.35 | kg/m2 | NULL | NULL | NULL |
Patient/0141f80a-700b-75a7-3f7f-91c788096fa8 | 1921-07-30 | [01938] | 1993-07-02T21:02:28+10:00 | 1990-11-11 06:10:28 | 1987-10-31 05:30:55 | 31.35 | kg/m2 | NULL | NULL | NULL |