Running a Spark JupyterLab Notebook
Use Cases for Spark JupyterLab Instances
Utilization of load_cohort. It requires running SQL on Spark and has a specialized functionality that we only support via dxdata (python).
Complex interactions with records/Spark must be done via Python.
Spark JupyterLab is ideal for extracting and interacting with the dataset or cohort.
Spark JupyterLab is NOT meant for downstream analysis.
General “Recipe” for Utilizing Spark JupyterLab Notebooks
Create a DX JupyterLab Notebook so that it will automatically save onto the Trusted Research Environment. You can do so by selecting these 2 different options:
Option 1 is from the Launcher:

b. Option 2 is from the DNAnexus Tab:

Start writing your JupyterLab Notebook. Select which kernel you are going to use (options will vary depending on the Image you selected in set up).
Download packages and save the software environment as a snapshot.
Download Packages
pip install ___ #python
Save the Snapshot of the environment
Start writing your code.
Import Packages using import (at minimum, you will need dx data and pyspark)
import dxdata import pprint import pyspark from pyspark.sql import functions as F
b. Load the dataset with dx extract dataset
dx extract_dataset dataset_id -ddd --delimiter
c. Initialize Spark
d. Retrieve data and cohorts that you are interested in
sc = pyspark.SparkContext() spark = pyspark.sql.SparkSession(sc)
e. Upload Results back to Project Space
%%bash dx upload FILE --destination /your/path/for/results
Save your DX Jupyterlab Notebook
Opening Notebooks from Project Storage
Notebooks can also be directly opened from project storage

When you save in JupyterLab, the notebook gets uploaded to the platform as a new file. This goes back to the concept of immutability.
Old version of notebook goes into .Notebook_archive/ folder in project.
Last updated
Was this helpful?