# Running a Spark JupyterLab Notebook

## Use Cases for Spark JupyterLab Instances

* Utilization of load\_cohort. It requires running SQL on Spark and has a specialized functionality that we only support via dxdata (python).
* Complex interactions with records/Spark must be done via Python.
* Spark JupyterLab is ideal for extracting and interacting with the dataset or cohort.&#x20;
* Spark JupyterLab is NOT meant for downstream analysis.&#x20;

## General “Recipe” for Utilizing Spark JupyterLab Notebooks&#x20;

1. Create a DX JupyterLab Notebook so that it will automatically save onto the Trusted Research Environment. You can do so by selecting these 2 different options:
   1. Option 1 is from the Launcher:&#x20;

<figure><img src="/files/P8LpEA6yYqIn6Bsd6kSw" alt=""><figcaption></figcaption></figure>

&#x20;       b.  Option 2 is from the DNAnexus Tab:&#x20;

<figure><img src="/files/aplT6kprUwHDaomQKMjv" alt=""><figcaption></figcaption></figure>

2. Start writing your JupyterLab Notebook. Select which kernel you are going to use (options will vary depending on the Image you selected in set up).&#x20;
3. Download packages and save the software environment as a snapshot.&#x20;

   1. Download Packages

   ```
   pip install ___ #python
   ```

   1. Save the Snapshot of the environment&#x20;
4. Start writing your code.

   1. &#x20;Import Packages using import (at minimum, you will need dx data and pyspark)

   ```
   import dxdata
   import pprint
   import pyspark
   from pyspark.sql import functions as F
   ```

   &#x20; b.  Load the dataset with dx extract dataset&#x20;

   ```
   dx extract_dataset dataset_id -ddd --delimiter 
   ```

   &#x20; c.  Initialize Spark

   &#x20; d. Retrieve data and cohorts that you are interested in

   ```
   sc = pyspark.SparkContext()
   spark = pyspark.sql.SparkSession(sc)
   ```

   &#x20; e.  Upload Results back to Project Space

   ```
   %%bash 
   dx upload FILE --destination /your/path/for/results
   ```
5. Save your DX Jupyterlab Notebook&#x20;

## Opening Notebooks from Project Storage

* Notebooks can also be directly opened from project storage

<figure><img src="/files/7H6NfgLu34UFZeV1aXK4" alt=""><figcaption></figcaption></figure>

* When you save in JupyterLab, the notebook gets uploaded to the platform as a new file. This goes back to the concept of immutability.
* Old version of notebook goes into .Notebook\_archive/ folder in project.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.dnanexus.com/interactivecloudcomputing/jupyterlab/running-a-spark-jupyterlab-notebook.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
