Getting Started with ML JupyterLab

ML JupyterLab is an app in the AI/ML Accelerator package. A license is required in order to use the AI/ML Accelerator package. For more information, please contact DNAnexus Sales via [email protected].

This example demonstrates the use of ML JupyterLab for hyperparameter tuning, using a proteomics dataset derived from 68 COVID-19 patients. The data is obtained from the study by Feyaerts et al., 2022.

Import libraries from ML-ready environments

The Python environment of ML JupyterLab has state-of-the-art ML libraries preinstalled so that you don't have to install them yourselves.

import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import pandas as pd
import joblib
from ray.util.joblib import register_ray
import mlflow
import mlflow.sklearn

Load data from DNAnexus with fsspec-dnanexus

If your data locates on DNAnexus, they can be loaded using the following syntax:

dnanexus://<PROJECT-ID>:/path/to/your/data
dnanexus://<PROJECT-ID>:<FILE-ID>

Behind the scene, ML JupyterLab uses fsspec-dnanexus to retrieve data via APIs provided by dxpy. Both packages are developed by DNAnexus.

Using DNAnexus URIs instead of physical paths makes your .ipynb file much more portable. As long as your collegues has the permission to read the data, they can use your .ipynb file immediately.

X = pd.read_csv('dnanexus://project-Gx76KPQ0vqKYjfY8j6Q67fyz:/Data/COVID-19_severity/Proteomics.csv', index_col=0)
y = pd.read_csv('dnanexus://project-Gx76KPQ0vqKYjfY8j6Q67fyz:file-GyZx3vQ0vqKq9JVxZjQ2FkB4', index_col=0)  # You can also use DNAnexus file-id
y['Mild&ModVsSevere'] = y['Mild&ModVsSevere'].astype(bool)  # Convert 0/1s to boolean values

QC your data with Data Profiler

Once the data is successfully retrieved, you can perform data quality control (QC) using the dxprofiler package. This tool, developed by DNAnexus, provides an interactive dashboard to enabling efficient and comprehensive QC.

Helper functions

This section is for preparing the data frames for Data Profiler.

def normalize_column_names(df_raw: pd.DataFrame) -> pd.DataFrame:
    '''
    This function normalizes column names to SQL-compatible names
    '''

    df = df_raw.copy()
    
    # Replace special characters
    df.columns = df.columns.str.replace(r'[^\w]', '_', regex=True)

    # Handle name that starts with numbers
    df.columns = df.columns.map(lambda x: f'x{x}' if x[0].isdigit() else x)

    # Remove trailing spaces
    df.columns = df.columns.str.strip()

    # Handle special keywords
    reserved_keywords = {"select", "from", "where", "table"}  # Add your DB's reserved words here
    df.columns = [f"{col}_col" if col.lower() in reserved_keywords else col for col in df.columns]

    return df

def include_index_column(df_raw: pd.DataFrame) -> pd.DataFrame:
    '''
    Bring the index column to a character column
    '''
    df = df_raw.copy()
    return df.reset_index()

def prepare_for_dxprofiler(df_raw: pd.DataFrame) -> pd.DataFrame:
    '''
    This function makes a dataframe become DataProfiler-friendly
    '''
    df = df_raw.copy()
    df = include_index_column(df)
    df = normalize_column_names(df)
    return df

Create a DXProfile

import dxprofiler
profile = dxprofiler.profile_dfs({'expression': prepare_for_dxprofiler(X), 'sample': prepare_for_dxprofiler(y)})

Launch the GUI

Once the processing is finished, you can launch the GUI of Data Profiler to assess the data. (Run the below code to load the illustrated images)

from IPython.display import Image
Image(filename='/home/dnanexus/notebook_examples/images/101-sklearn_on_ray-screen1.png')

In this screen, we can see that the expression and sample are connected by the sampleID column. The Venn diagram indicates 68 samples shared between these tables. It tells us that there is no arbitrary ID in the data (aka no sample ID that only appears in one table).

Image(filename='/home/dnanexus/notebook_examples/images/101-sklearn_on_ray-screen2.png')

In this second screen, we look more specifically into the Mild_ModvsSevere column of the sample table. There are 43 mild and 25 severe cases. And there is no missing value there. Looks like we are good to go forward.

Command to open the interactive Data Profiler GUI.

profile.visualize()

As you can see from this quick showcase, using dxprofiler is a neat way to understand your dataset. The screens above are just a tiny fraction of what this package can do. If you are interested, please learn more at the Data Profiler Documentation.

Hyperparameter tuning (on Ray cluster)

We will run hyperparameter tuning on a Support Vector Classifier (SVC) model with a Radial Basis Function (RBF) kernel.

Firstly, let's define our search space and model.

param_space = {
    'C': np.logspace(-6, 6, 30),
    'gamma': np.logspace(-8, 8, 30),
    'tol': np.logspace(-4, -1, 30),
    'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf', probability=True)
search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300, verbose=0)

# Convert y to an 1d array of 0 and 1 for the Logistic Regression
y = y['Mild&ModVsSevere'].astype(int).values.ravel()

Before starting the hyperparameter tuning step, let set up an MLflow Experiment for model logging purpose later.

mlflow.set_experiment("COVID Severity")

At this stage, you can start running with:

search.fit(X, y)

That is the standard way to do hyperparameter tuning. However, ML JupyterLab is deployed on a Ray cluster. This architecture can speed up your script several times depending on the number of nodes. To leverage the computing power of ML JupyterLab, simply put your script in a ray context.

Besides, in order to log the best model and its parameters, let start the MLflow run first.

with mlflow.start_run() as run:
    register_ray()  # Only needed once per session

    with joblib.parallel_backend('ray'):
        search.fit(X, y)

    # Log best parameters
    mlflow.log_params(search.best_params_)

    # Log best model
    mlflow.sklearn.log_model(search.best_estimator_, "SVM_Model")

    # Log best score
    mlflow.log_metric("best_score", search.best_score_)

The run has been logged into the MLflow Tracking Server. To check it, let open the DX MLFlow package on the ML JupyterLab Homepage, and access the COVID Severity Experiment.

How much faster, exactly?

To see how much faster ML JupyterLab can handle that step, let's measure the execution time.

import time
from functools import wraps

def track_execution_time(func):
    '''A decorator to track execution time'''
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        execution_time = end_time - start_time
        msg = f"{func.__name__!r} finishes in {execution_time:.4f} seconds"
        print(msg)
        return result
    return wrapper

@track_execution_time
def with_ray():
    with joblib.parallel_backend('ray'):
        search.fit(X, y)

@track_execution_time
def without_ray():
    search.fit(X, y)

with_ray()
without_ray()

As you can see, running with Ray in ML JupyterLab can speed up your script at least 2 to 3 times. With the combined computational power of multiple instances, ML JupyterLab creates a much more scalable workspace for AI/ML development compared to a single node. For instance, on a single node, you can use up to 128 cores (i.e. mem4_ssd1_x128). With ML JupyterLab, you can easily obtain a workspace that goes beyond 128 cores.

Plus, everything will be running inside a secure and compliant environment!

Evaluate your model

Evaluate on the training data

from sklearn.metrics import classification_report

y_train_pred = search.best_estimator_.predict(X)
print(classification_report(y, y_train_pred))

Evaluate with Cross-Validation Predictions

You can also evaluate the model's performance without a separate test set. Let's create a SVC model with best parameters found in the previous steps.

from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplay
from sklearn.model_selection import cross_val_predict

y_scores = cross_val_predict(search.best_estimator_, X, y, cv=5, method='predict_proba')[:, 1]

Next, let's create a ROC curve from the prediction result.

import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay

fpr, tpr, thresholds = roc_curve(y, y_scores)
roc_auc = roc_auc_score(y, y_scores)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"ROC curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()

Resources

Full Documentation

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.

PreviousIn App Features NextMLflow

Last updated 4 months ago

Was this helpful?