Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • JupyterLab Example: MLflow Quickstart
  • Getting Started with MLflow on DNAnexus
  • Resources

Was this helpful?

Export as PDF
  1. AI/ ML Accelerator
  2. MLflow

Utilizing MLflow in JupyterLab

PreviousUsing Existing Model

Last updated 2 months ago

Was this helpful?

AI/ML Accelerator - MLflow is specifically built to track your ML experiments on the DNAnexus platform environment via the ML JupyterLab (another app in the AI/ML Accelerator package) environment. A license is required in order to use the AI/ML Accelerator package. For more information, please contact DNAnexus Sales via .

JupyterLab Example: MLflow Quickstart

AI/ML Accelerator - MLflow is specifically built to track your ML experiments on the DNAnexus platform environment via the ML JupyterLab (another app in the AI/ML Accelerator package) environment. A license is required in order to use the AI/ML Accelerator package. For more information, please contact DNAnexus Sales via .

The title of this JupyterLab notebook in the launcher is “MLflow Quickstart”

Getting Started with MLflow on DNAnexus

This notebook demonstrates how to log your models to DNAnexus platform storage using MLflow, and then use the logged models to predict on the new dataset.

Importing Required Libraries

This demonstration uses the scikit-learn framework on the Iris dataset. The required libraries are pre-installed in the ML JupyterLab environment, so you can directly import them without the need for installation.

import mlflow
import mlflow.sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import ray
from ray.util.joblib import register_ray
from joblib import parallel_backend

Data Preparation

In this step, we use the Breast Cancer dataset provided by scikit-learn. This dataset includes features extracted from breast cancer cell nuclei obtained from biopsy samples. There are 30 numeric features such as mean radius, mean texture, mean area, etc., and the target variable indicates whether the tumor is malignant (1) or benign (0).

To evaluate the model’s performance, we split the dataset into training and testing sets. 80% of the data is used for training, and 20% is reserved for testing.

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Define an MLflow Experiment

In order to group any distinct runs of a particular project or idea together, we can define an Experiment that will group each iteration (runs) together. Defining a unique name that is relevant to what we’re working on helps with organization and reduces the amount of work (searching) to find our runs later on.

mlflow.set_experiment("My MLflow Experiment on DNAnexus") # Replace this with your own experiment name

Enable MLflow Autologging

MLflow’s autologging feature automatically logs metrics, parameters, and models during training. Here, we enable it for scikit-learn, which ensures that relevant details about the training process are captured without manual intervention.

mlflow.sklearn.autolog()

Train the model and log with MLflow

This step involves training a RandomForestClassifier, a popular ensemble learning method. The training process is encapsulated in an MLflow run to capture the details.

# Initialize Ray
ray.init(ignore_reinit_error=True)  # Ensures Ray starts and avoids reinit errors
register_ray()  # Register Ray with joblib (needed only once per session)

# Start MLflow run
with joblib.parallel_backend('ray'):
    with mlflow.start_run() as run:
        # Initialize the RandomForestClassifier with specific hyperparameters.
        model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

        # Train the model using the training dataset.
        model.fit(X_train, y_train)

        # Make predictions on the test dataset.
        predictions = model.predict(X_test)
        probabilities = model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class

        # Calculate evaluation metrics.
        accuracy = accuracy_score(y_test, predictions) # Accuracy measures the overall correctness of the predictions.
        roc_auc = roc_auc_score(y_test, probabilities) # ROC AUC measures the model's ability to distinguish between the classes.

        # Print the details of the MLflow run for reference.
        run_id = run.info.run_id
        print(f"Run ID: {run_id}")
        print(f"Test Accuracy: {accuracy}")
        print(f"Test ROC AUC: {roc_auc}")

Register the model

Once the model is logged, we register it in the MLflow Model Registry. This allows the model to be versioned and used across different environments.

model_name = "breast_cancer_model"

# Register the model to the MLflow Model Registry
registered_model = mlflow.register_model(model_uri=f"runs:/{run_id}/model", name=model_name)

print(f"Model registered with name: {model_name}")

Load the registered model and make predictions

In this step, we load the registered model from the MLflow Model Registry and use it to make predictions on new data.

from mlflow.pyfunc import load_model

# Specify the version or stage of the model to load.
model_version = 1  # Here, we load version 1, but this can be updated based on your deployment pipeline.
loaded_model = load_model(f"models:/{model_name}/{model_version}")

# The model can also be loaded by its path which is shared by your collaborators. For example,
# loaded_model = load_model("dnanexus://project-xxxx:/.mlflow/...")

# Make predictions using the loaded model.
sample_data = X_test[:5]  # To demonstrate, we take a sample of 5 test data points.
sample_predictions = loaded_model.predict(sample_data)

# Print the sample predictions for verification.
print("Sample Predictions (Malignant=1, Benign=0):", sample_predictions)

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select “Contact Support”

  3. Fill in the Subject and Message to submit a support ticket.

To view the logged experiment, runs, and registered models, let open the MLflow Tracking Server GUI by accessing the ‘DX MLFlow’ on the JupyterLab Launcher. See the MLflow User Guide on the Academy Page () for more details (you’re already here, but this is what will be present on the ML JupyterLab Example at the Launcher).

sales@dnanexus.com
sales@dnanexus.com
https://academy.dnanexus.com/
Full Documentation