Utilizing MLflow in JupyterLab

AI/ML Accelerator - MLflow is specifically built to track your ML experiments on the DNAnexus platform environment via the ML JupyterLab (another app in the AI/ML Accelerator package) environment. A license is required in order to use the AI/ML Accelerator package. For more information, please contact DNAnexus Sales via [email protected].

JupyterLab Example: MLflow Quickstart

The title of this JupyterLab notebook in the launcher is “MLflow Quickstart”

Getting Started with MLflow on DNAnexus

This notebook demonstrates how to log your models to DNAnexus platform storage using MLflow, and then use the logged models to predict on the new dataset.

Importing Required Libraries

This demonstration uses the scikit-learn framework on the Iris dataset. The required libraries are pre-installed in the ML JupyterLab environment, so you can directly import them without the need for installation.

import mlflow
import mlflow.sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import ray
from ray.util.joblib import register_ray
from joblib import parallel_backend

Data Preparation

In this step, we use the Breast Cancer dataset provided by scikit-learn. This dataset includes features extracted from breast cancer cell nuclei obtained from biopsy samples. There are 30 numeric features such as mean radius, mean texture, mean area, etc., and the target variable indicates whether the tumor is malignant (1) or benign (0).

To evaluate the model’s performance, we split the dataset into training and testing sets. 80% of the data is used for training, and 20% is reserved for testing.

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Define an MLflow Experiment

In order to group any distinct runs of a particular project or idea together, we can define an Experiment that will group each iteration (runs) together. Defining a unique name that is relevant to what we’re working on helps with organization and reduces the amount of work (searching) to find our runs later on.

mlflow.set_experiment("My MLflow Experiment on DNAnexus") # Replace this with your own experiment name

Enable MLflow Autologging

MLflow’s autologging feature automatically logs metrics, parameters, and models during training. Here, we enable it for scikit-learn, which ensures that relevant details about the training process are captured without manual intervention.

mlflow.sklearn.autolog()

Train the model and log with MLflow

This step involves training a RandomForestClassifier, a popular ensemble learning method. The training process is encapsulated in an MLflow run to capture the details.

# Initialize Ray
ray.init(ignore_reinit_error=True)  # Ensures Ray starts and avoids reinit errors
register_ray()  # Register Ray with joblib (needed only once per session)

# Start MLflow run
with joblib.parallel_backend('ray'):
    with mlflow.start_run() as run:
        # Initialize the RandomForestClassifier with specific hyperparameters.
        model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

        # Train the model using the training dataset.
        model.fit(X_train, y_train)

        # Make predictions on the test dataset.
        predictions = model.predict(X_test)
        probabilities = model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class

        # Calculate evaluation metrics.
        accuracy = accuracy_score(y_test, predictions) # Accuracy measures the overall correctness of the predictions.
        roc_auc = roc_auc_score(y_test, probabilities) # ROC AUC measures the model's ability to distinguish between the classes.

        # Print the details of the MLflow run for reference.
        run_id = run.info.run_id
        print(f"Run ID: {run_id}")
        print(f"Test Accuracy: {accuracy}")
        print(f"Test ROC AUC: {roc_auc}")

Register the model

Once the model is logged, we register it in the MLflow Model Registry. This allows the model to be versioned and used across different environments.

model_name = "breast_cancer_model"

# Register the model to the MLflow Model Registry
registered_model = mlflow.register_model(model_uri=f"runs:/{run_id}/model", name=model_name)

print(f"Model registered with name: {model_name}")

To view the logged experiment, runs, and registered models, let open the MLflow Tracking Server GUI by accessing the ‘DX MLFlow’ on the JupyterLab Launcher. See the MLflow User Guide on the Academy Page (https://academy.dnanexus.com/) for more details (you’re already here, but this is what will be present on the ML JupyterLab Example at the Launcher).

Load the registered model and make predictions

In this step, we load the registered model from the MLflow Model Registry and use it to make predictions on new data.

from mlflow.pyfunc import load_model

# Specify the version or stage of the model to load.
model_version = 1  # Here, we load version 1, but this can be updated based on your deployment pipeline.
loaded_model = load_model(f"models:/{model_name}/{model_version}")

# The model can also be loaded by its path which is shared by your collaborators. For example,
# loaded_model = load_model("dnanexus://project-xxxx:/.mlflow/...")

# Make predictions using the loaded model.
sample_data = X_test[:5]  # To demonstrate, we take a sample of 5 test data points.
sample_predictions = loaded_model.predict(sample_data)

# Print the sample predictions for verification.
print("Sample Predictions (Malignant=1, Benign=0):", sample_predictions)

Resources

Full Documentation

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.

PreviousUsing Existing Model

Last updated 4 months ago

Was this helpful?