Running a Spark JupyterLab Notebook

Use Cases for Spark JupyterLab Instances

  • Utilization of load_cohort. It requires running SQL on Spark and has a specialized functionality that we only support via dxdata (python).

  • Complex interactions with records/Spark must be done via Python.

  • Spark JupyterLab is ideal for extracting and interacting with the dataset or cohort.

  • Spark JupyterLab is NOT meant for downstream analysis.

General “Recipe” for Utilizing Spark JupyterLab Notebooks

  1. Create a DX JupyterLab Notebook so that it will automatically save onto the Trusted Research Environment. You can do so by selecting these 2 different options:

    1. Option 1 is from the Launcher:

b. Option 2 is from the DNAnexus Tab:

  1. Start writing your JupyterLab Notebook. Select which kernel you are going to use (options will vary depending on the Image you selected in set up).

  2. Download packages and save the software environment as a snapshot.

    1. Download Packages

    pip install ___ #python
    1. Save the Snapshot of the environment

  3. Start writing your code.

    1. Import Packages using import (at minimum, you will need dx data and pyspark)

    import dxdata
    import pprint
    import pyspark
    from pyspark.sql import functions as F

    b. Load the dataset with dx extract dataset

    dx extract_dataset dataset_id -ddd --delimiter 

    c. Initialize Spark

    d. Retrieve data and cohorts that you are interested in

    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)

    e. Upload Results back to Project Space

    %%bash 
    dx upload FILE --destination /your/path/for/results
  4. Save your DX Jupyterlab Notebook

Opening Notebooks from Project Storage

  • Notebooks can also be directly opened from project storage

  • When you save in JupyterLab, the notebook gets uploaded to the platform as a new file. This goes back to the concept of immutability.

  • Old version of notebook goes into .Notebook_archive/ folder in project.

Last updated

Was this helpful?