Getting Started with ML JupyterLab

ML JupyterLab is an app in the AI/ML Accelerator package. A license is required in order to use the AI/ML Accelerator package. For more information, please contact DNAnexus Sales via [email protected].

This example demonstrates the use of ML JupyterLab for hyperparameter tuning, using a proteomics dataset derived from 68 COVID-19 patients. The data is obtained from the study by Feyaerts et al., 2022.

Import libraries from ML-ready environments

The Python environment of ML JupyterLab has state-of-the-art ML libraries preinstalled so that you don't have to install them yourselves.

import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import pandas as pd
import joblib
from ray.util.joblib import register_ray
import mlflow
import mlflow.sklearn

Load data from DNAnexus with fsspec-dnanexus

If your data locates on DNAnexus, they can be loaded using the following syntax:

  • dnanexus://<PROJECT-ID>:/path/to/your/data

  • dnanexus://<PROJECT-ID>:<FILE-ID>

Behind the scene, ML JupyterLab uses fsspec-dnanexus to retrieve data via APIs provided by dxpy. Both packages are developed by DNAnexus.

Using DNAnexus URIs instead of physical paths makes your .ipynb file much more portable. As long as your collegues has the permission to read the data, they can use your .ipynb file immediately.

QC your data with Data Profiler

Once the data is successfully retrieved, you can perform data quality control (QC) using the dxprofiler package. This tool, developed by DNAnexus, provides an interactive dashboard to enabling efficient and comprehensive QC.

Helper functions

This section is for preparing the data frames for Data Profiler.

Create a DXProfile

Launch the GUI

Once the processing is finished, you can launch the GUI of Data Profiler to assess the data. (Run the below code to load the illustrated images)

In this screen, we can see that the expression and sample are connected by the sampleID column. The Venn diagram indicates 68 samples shared between these tables. It tells us that there is no arbitrary ID in the data (aka no sample ID that only appears in one table).

In this second screen, we look more specifically into the Mild_ModvsSevere column of the sample table. There are 43 mild and 25 severe cases. And there is no missing value there. Looks like we are good to go forward.

Command to open the interactive Data Profiler GUI.

As you can see from this quick showcase, using dxprofiler is a neat way to understand your dataset. The screens above are just a tiny fraction of what this package can do. If you are interested, please learn more at the Data Profiler Documentation.

Hyperparameter tuning (on Ray cluster)

We will run hyperparameter tuning on a Support Vector Classifier (SVC) model with a Radial Basis Function (RBF) kernel.

Firstly, let's define our search space and model.

Before starting the hyperparameter tuning step, let set up an MLflow Experiment for model logging purpose later.

At this stage, you can start running with:

That is the standard way to do hyperparameter tuning. However, ML JupyterLab is deployed on a Ray cluster. This architecture can speed up your script several times depending on the number of nodes. To leverage the computing power of ML JupyterLab, simply put your script in a ray context.

Besides, in order to log the best model and its parameters, let start the MLflow run first.

The run has been logged into the MLflow Tracking Server. To check it, let open the DX MLFlow package on the ML JupyterLab Homepage, and access the COVID Severity Experiment.

How much faster, exactly?

To see how much faster ML JupyterLab can handle that step, let's measure the execution time.

As you can see, running with Ray in ML JupyterLab can speed up your script at least 2 to 3 times. With the combined computational power of multiple instances, ML JupyterLab creates a much more scalable workspace for AI/ML development compared to a single node. For instance, on a single node, you can use up to 128 cores (i.e. mem4_ssd1_x128). With ML JupyterLab, you can easily obtain a workspace that goes beyond 128 cores.

Plus, everything will be running inside a secure and compliant environment!

Evaluate your model

Evaluate on the training data

Evaluate with Cross-Validation Predictions

You can also evaluate the model's performance without a separate test set. Let's create a SVC model with best parameters found in the previous steps.

Next, let's create a ROC curve from the prediction result.

Resources

Full Documentation

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select “Contact Support”

  3. Fill in the Subject and Message to submit a support ticket.

Last updated

Was this helpful?