Running a Spark JupyterLab Notebook

Use Cases for Spark JupyterLab Instances

Utilization of load_cohort. It requires running SQL on Spark and has a specialized functionality that we only support via dxdata (python).
Complex interactions with records/Spark must be done via Python.
Spark JupyterLab is ideal for extracting and interacting with the dataset or cohort.
Spark JupyterLab is NOT meant for downstream analysis.

Create a DX JupyterLab Notebook so that it will automatically save onto the Trusted Research Environment. You can do so by selecting these 2 different options:
1. Option 1 is from the Launcher:

b. Option 2 is from the DNAnexus Tab:

Start writing your JupyterLab Notebook. Select which kernel you are going to use (options will vary depending on the Image you selected in set up).
Download packages and save the software environment as a snapshot.
1. Download Packages
```
pip install ___ #python
```
1. Save the Snapshot of the environment
Start writing your code.
1. Import Packages using import (at minimum, you will need dx data and pyspark)
```
import dxdata
import pprint
import pyspark
from pyspark.sql import functions as F
```
b. Load the dataset with dx extract dataset
```
dx extract_dataset dataset_id -ddd --delimiter 
```
c. Initialize Spark
d. Retrieve data and cohorts that you are interested in
```
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
```
e. Upload Results back to Project Space
```
%%bash 
dx upload FILE --destination /your/path/for/results
```
Save your DX Jupyterlab Notebook

When you save in JupyterLab, the notebook gets uploaded to the platform as a new file. This goes back to the concept of immutability.
Old version of notebook goes into .Notebook_archive/ folder in project.

Last updated 2 months ago

Was this helpful?