Tahoe-100M
Necessary Disclaimers and Legal
The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.
Users are responsible for the costs associated with analyzing the Tahoe 100M dataset and its storage in their project spaces.
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Introduction
Here, we introduce the Tahoe-100M dataset and three Jupyter notebooks using this dataset on the DNAnexus platform. The notebooks demonstrate how to handle large single-cell data, using out-of-core and GPU-accelerated computing strategies. Workflows covered include data preprocessing, PCA computation, UMAP visualization, and deep learning-based cell line classification. Both CPU and GPU compute environments are supported.
All notebooks are built to run within the Jupyterlab with Python, R, Stata, ML, Image Processing on DNAnexus, using a pre-configured snapshot that eliminates the need for manual package installation.
List of notebooks
Notebook
Purpose
Compute
Tahoe_pca_tutorial_cpu_dxjupyterlab-2026-04-14.ipynb
Out-of-core PCA computation and UMAP visualization using Scanpy
CPU
Tahoe_pca_tutorial_gpu_dxjupyterlab-2026-04-14.ipynb
Out-of-core PCA computation and UMAP visualization using RAPIDS SingleCell
GPU
Tahoe_scDataset_v3_tutorial_gpu-2026-04-14.ipynb
Streaming data loading with scDataset and PyTorch linear classifier training
GPU
Citations for Tahoe-100M Dataset
The paper titled Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling describes how the Tahoe 100M dataset was curated and is currently published on BioRXiv.
Tahoe 100M is hosted as part of Arc Institute’s Virtual Cell Atlas.
The instructions for Arc Institute’s official version of the dataset is hosted on their Github.
Overview of the Tahoe-100M Dataset
The Tahoe-100M dataset was generated using Tahoe’s Mosaic Platform in partnership with Parse Biosciences and Ultima Genomics. This 100 million single cell dataset has been curated to accelerate discovery through modeling of gene-drug and gene-gene interactions at a single cell level, training AI/ML models grounded in single cell biology, mapping drug responses across cell types and states, and benchmarking and validating modeling with confidence due to the size. These use cases are further illustrated in the figure below:

The Tahoe-100M dataset is now available as part of the Arc Institute’s Virtual Cell Atlas, which is openly accessible for scientific use. At DNAnexus, we have downloaded this data for your use on the platform without having to download or set it up further. See the “Where to Access Tahoe-100M” section below to start accessing the dataset.
Where to Access Tahoe-100M
The following files are available for the Tahoe-100M dataset:
The original set of files retrieved directly from Arc Institute’s GCP storage. The location of these files are found on the platform here for AWS US East, here for AWS Europe (Frankfurt), here for AWS Europe (London), here for Azure Amsterdam, and here for Azure US (West).
The AnnData files are converted to Parquet files, for users who would prefer to use big data analytics tools, including Spark, to analyze the Tahoe-100M data. The location of these files are found on the platform here for AWS US East , here for AWS Europe (Frankfurt), here for AWS Europe (London), here for Azure Amsterdam, and here for Azure US (West).
Notebooks to analyze the Tahoe-100M dataset can be found on the platform here for AWS US East, here for AWS Europe (Frankfurt), here for AWS Europe (London), here for Azure Amsterdam, and here for Azure US (West). The file endings are .ipynb .
To use the dataset and notebooks, please copy the data and notebooks into your own project space. Details on how to copy the data are present under the section titled "Copying Data and Notebooks into a Project".
Running Analyses on Tahoe-100M
Copying Data and Notebooks into a Project
To utilize the dataset, please copy the data from the projects listed above into your own project.
Here are the steps to copy the Tahoe-100M data into a Project Space:
Create a project for your Tahoe-100M dataset, billed to your own organization. Tutorials on how to set up a project can be found on this page.
Go to Resources Tab and find the project titled “Public Datasets Region”. Select the project that matches your region and select the folder "Tahoe-100M".
Select the data folder and the notebooks
Select "Copy" on the top right menu, and select the project that you created in Step 1.
Then, go to the project space you created in Step 1 to start exploring the Tahoe-100M dataset and notebooks.
To run the JupyterLab Notebooks, please see the JupyterLab section of the Academy Documentation.
Running the Notebooks
The notebook is optimized for the JupyterLab with Python, R, Stata, ML, Image Processing (v2.11.0). If you do not have access, please contact the Success Team at [email protected] or the Sales Team at [email protected]
Load the snapshot when you launch JupyterLab: snapshot-single_cell-dxjupyterlab-2026-04-08.tar.gz in the Notebook_snapshot folder.
Please follow Introduction to JupyterLab to learn how to load a snapshot and launch JupyterLab.
Instance Type Selection and Kernel Selection
Instance times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability.
GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Notebook
Instance type
Kernel
Tahoe_pca_tutorial_cpu_dxjupyterlab-2026-04-14.ipynb
mem2_ssd1_v2_x16
Python 3.12 CPU
Tahoe_pca_tutorial_gpu_dxjupyterlab-2026-04-14.ipynb
mem2_ssd1_gpu_x48
Python 3.12 GPU
Tahoe_scDataset_v3_tutorial_gpu-2026-04-14.ipynb
mem2_ssd1_gpu_x16
Python 3.12 GPU
Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.
In the snapshot snapshot-single_cell-dxjupyterlab-2026-04-08.tar.gz, two conda environments are pre-configured: one for CPU and one for GPU. Please select the appropriate kernel for each notebook according to the table above.
A note on Quality control filtering:
We removed three low-quality cell lines → 47 cell lines remaining.
We applied the “full” filtering option for quality control. The Tahoe dataset provides two QC levels: minimal and full. “full” applies stricter filtering.
Video: Setting Up the Tahoe-100M Dataset Analysis on the DNAnexus Platform
Last updated