Tahoe-100M

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with analyzing the Tahoe 100M dataset and its storage in their project spaces.

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

Introduction

Here, we introduce the Tahoe-100M dataset and three Jupyter notebooks using this dataset on the DNAnexus platform. The notebooks demonstrate how to handle large single-cell data, using out-of-core and GPU-accelerated computing strategies. Workflows covered include data preprocessing, PCA computation, UMAP visualization, and deep learning-based cell line classification. Both CPU and GPU compute environments are supported.

All notebooks are built to run within the Jupyterlab with Python, R, Stata, ML, Image Processing on DNAnexus, using a pre-configured snapshot that eliminates the need for manual package installation.

List of notebooks

Notebook

Purpose

Compute

Tahoe_pca_tutorial_cpu_dxjupyterlab-2026-04-14.ipynb

Out-of-core PCA computation and UMAP visualization using Scanpy

CPU

Tahoe_pca_tutorial_gpu_dxjupyterlab-2026-04-14.ipynb

Out-of-core PCA computation and UMAP visualization using RAPIDS SingleCell

GPU

Tahoe_scDataset_v3_tutorial_gpu-2026-04-14.ipynb

Streaming data loading with scDataset and PyTorch linear classifier training

GPU

Citations for Tahoe-100M Dataset

The paper titled Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling describes how the Tahoe 100M dataset was curated and is currently published on BioRXiv.

Tahoe 100M is hosted as part of Arc Institute’s Virtual Cell Atlas.

The instructions for Arc Institute’s official version of the dataset is hosted on their Github.

Overview of the Tahoe-100M Dataset

The Tahoe-100M dataset was generated using Tahoe’s Mosaic Platform in partnership with Parse Biosciences and Ultima Genomics. This 100 million single cell dataset has been curated to accelerate discovery through modeling of gene-drug and gene-gene interactions at a single cell level, training AI/ML models grounded in single cell biology, mapping drug responses across cell types and states, and benchmarking and validating modeling with confidence due to the size. These use cases are further illustrated in the figure below:

The Tahoe-100M dataset is now available as part of the Arc Institute’s Virtual Cell Atlas, which is openly accessible for scientific use. At DNAnexus, we have downloaded this data for your use on the platform without having to download or set it up further. See the “Where to Access Tahoe-100M” section below to start accessing the dataset.

Where to Access Tahoe-100M

The following files are available for the Tahoe-100M dataset:

To use the dataset and notebooks, please copy the data and notebooks into your own project space. Details on how to copy the data are present under the section titled "Copying Data and Notebooks into a Project".

Running Analyses on Tahoe-100M

Copying Data and Notebooks into a Project

To utilize the dataset, please copy the data from the projects listed above into your own project.

Here are the steps to copy the Tahoe-100M data into a Project Space:

  1. Create a project for your Tahoe-100M dataset, billed to your own organization. Tutorials on how to set up a project can be found on this page.

  2. Go to Resources Tab and find the project titled “Public Datasets Region”. Select the project that matches your region and select the folder "Tahoe-100M".

  3. Select the data folder and the notebooks

  4. Select "Copy" on the top right menu, and select the project that you created in Step 1.

  5. Then, go to the project space you created in Step 1 to start exploring the Tahoe-100M dataset and notebooks.

  6. To run the JupyterLab Notebooks, please see the JupyterLab section of the Academy Documentation.

Running the Notebooks

  • The notebook is optimized for the JupyterLab with Python, R, Stata, ML, Image Processing (v2.11.0). If you do not have access, please contact the Success Team at [email protected] or the Sales Team at [email protected]

  • Load the snapshot when you launch JupyterLab: snapshot-single_cell-dxjupyterlab-2026-04-08.tar.gz in the Notebook_snapshot folder.

  • Please follow Introduction to JupyterLab to learn how to load a snapshot and launch JupyterLab.

Instance Type Selection and Kernel Selection

  • Instance times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability.

  • GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.

  • Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

Notebook

Instance type

Kernel

Tahoe_pca_tutorial_cpu_dxjupyterlab-2026-04-14.ipynb

mem2_ssd1_v2_x16

Python 3.12 CPU

Tahoe_pca_tutorial_gpu_dxjupyterlab-2026-04-14.ipynb

mem2_ssd1_gpu_x48

Python 3.12 GPU

Tahoe_scDataset_v3_tutorial_gpu-2026-04-14.ipynb

mem2_ssd1_gpu_x16

Python 3.12 GPU

  • Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.

  • In the snapshot snapshot-single_cell-dxjupyterlab-2026-04-08.tar.gz, two conda environments are pre-configured: one for CPU and one for GPU. Please select the appropriate kernel for each notebook according to the table above.

  • A note on Quality control filtering:

  • We removed three low-quality cell lines → 47 cell lines remaining.

  • We applied the “full” filtering option for quality control. The Tahoe dataset provides two QC levels: minimal and full. “full” applies stricter filtering.

Video: Setting Up the Tahoe-100M Dataset Analysis on the DNAnexus Platform

Last updated