> For the complete documentation index, see [llms.txt](https://academy.dnanexus.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://academy.dnanexus.com/public-datasets-on-the-dnanexus-platform/single-cell/tahoe-100m.md).

# Tahoe-100M

#### Necessary Disclaimers and Legal

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with analyzing the Tahoe 100M dataset and its storage in their project spaces.&#x20;

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.&#x20;

## Introduction

Here, we introduce the Tahoe-100M dataset and three Jupyter notebooks using this dataset on the DNAnexus platform. The notebooks demonstrate how to handle large single-cell data, using out-of-core and GPU-accelerated computing strategies. Workflows covered include data preprocessing, PCA computation, UMAP visualization, and deep learning-based cell line classification. Both CPU and GPU compute environments are supported.

All notebooks are built to run within the [Jupyterlab with Python, R, Stata, ML, Image Processing](https://academy.dnanexus.com/interactivecloudcomputing/jupyterlab/introduction) on DNAnexus, using a pre-configured snapshot that eliminates the need for manual package installation.

### List of notebooks

| Notebook                                                 | Purpose                                                                      | Compute |
| -------------------------------------------------------- | ---------------------------------------------------------------------------- | ------- |
| Tahoe\_pca\_tutorial\_cpu\_dxjupyterlab-2026-04-14.ipynb | Out-of-core PCA computation and UMAP visualization using Scanpy              | CPU     |
| Tahoe\_pca\_tutorial\_gpu\_dxjupyterlab-2026-04-14.ipynb | Out-of-core PCA computation and UMAP visualization using RAPIDS SingleCell   | GPU     |
| Tahoe\_scDataset\_v3\_tutorial\_gpu-2026-04-14.ipynb     | Streaming data loading with scDataset and PyTorch linear classifier training | GPU     |

## Citations for Tahoe-100M Dataset

The paper titled [Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling](https://www.biorxiv.org/content/10.1101/2025.02.20.639398v3) describes how the Tahoe 100M dataset was curated and is currently published on BioRXiv. &#x20;

Tahoe 100M is hosted as part of [Arc Institute’s Virtual Cell Atlas](https://arcinstitute.org/tools/virtualcellatlas).&#x20;

The instructions for Arc Institute’s official version of the dataset is hosted on [their Github.](https://github.com/ArcInstitute/arc-virtual-cell-atlas/blob/main/tahoe-100M/README.md)

## Overview of the Tahoe-100M Dataset&#x20;

The Tahoe-100M dataset was generated using Tahoe’s Mosaic Platform in partnership with Parse Biosciences and Ultima Genomics. This 100 million single cell dataset has been curated to accelerate discovery through modeling of gene-drug and gene-gene interactions at a single cell level, training AI/ML models grounded in single cell biology, mapping drug responses across cell types and states, and benchmarking and validating modeling with confidence due to the size.  These use cases are further illustrated in the figure below:

<br>

<figure><img src="/files/ahmgi0jEHq1IINbdfcGX" alt=""><figcaption></figcaption></figure>

\
The Tahoe-100M dataset is now available as part of the Arc Institute’s Virtual Cell Atlas, which is openly accessible for scientific use.  At DNAnexus, we have downloaded this data for your use on the platform without having to download or set it up further. See the “Where to Access Tahoe-100M” section below to start accessing the dataset.&#x20;

## Where to Access Tahoe-100M

The following files are available for the Tahoe-100M dataset:&#x20;

* &#x20;The original set of files retrieved directly from Arc Institute’s GCP storage. The location of these files are found on the platform [here for AWS US East](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/Tahoe-100M/data/anndata), [here for AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/Tahoe-100M/data/anndata), [here for AWS Europe (London)](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/Tahoe-100M/data/anndata), [here for Azure Amsterdam,](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/Tahoe-100M/data/anndata) and [here for Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/Tahoe-100M/data/anndata).&#x20;
* The AnnData files are converted to Parquet files, for users who would prefer to use big data analytics tools, including Spark, to analyze the Tahoe-100M data. The location of these files are found on the platform [here for AWS US East](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/Tahoe-100M/data/parquet) , [here for AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/Tahoe-100M/data/parquet), [here for AWS Europe (London)](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/Tahoe-100M/data/parquet), [here for Azure Amsterdam,](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/Tahoe-100M/data/parquet) and [here for Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/Tahoe-100M/data/parquet).
* Notebooks to analyze the Tahoe-100M dataset can be found on the platform [here for AWS US East](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/Tahoe-100M), [here for AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/Tahoe-100M), [here for AWS Europe (London)](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/Tahoe-100M/), [here for Azure Amsterdam,](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/Tahoe-100M) and [here for Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/Tahoe-100M). The file endings are .ipynb .&#x20;

To use the dataset and notebooks, please copy the data and notebooks into your own project space. Details on how to copy the data are present under the section titled "Copying Data and Notebooks into a Project".&#x20;

## Running Analyses on Tahoe-100M

### Copying Data and Notebooks into a Project&#x20;

To utilize the dataset, please copy the data from the projects listed above into your own project.&#x20;

Here are the steps to copy the Tahoe-100M data into a Project Space:&#x20;

1. Create a project for your Tahoe-100M dataset, billed to your own organization.  Tutorials on how to set up a project can be found [on this page](https://academy.dnanexus.com/overview-of-the-platform/setting-up-a-project).
2. Go to Resources Tab and find the project titled “Public Datasets *Region*”. Select the project that matches your region and select the folder "Tahoe-100M". &#x20;
3. Select the data folder and the notebooks
4. Select "Copy" on the top right menu, and select the project that you created in Step 1.&#x20;
5. Then, go to the project space you created in Step 1 to start exploring the Tahoe-100M dataset and notebooks.
6. To run the JupyterLab Notebooks, please see the [JupyterLab section of the Academy Documentation](/interactivecloudcomputing/jupyterlab.md).&#x20;

### Running the Notebooks

* The notebook is optimized for the JupyterLab with Python, R, Stata, ML, Image Processing (v2.11.0). If you do not have access, please contact the Success Team at <success@dnanexus.com> or the Sales Team at <sales@dnanexus.com>
* Load the snapshot when you launch JupyterLab: **snapshot-single\_cell-dxjupyterlab-2026-04-08.tar.gz** in the Notebook\_snapshot folder.
* Please follow [Introduction to JupyterLab](https://academy.dnanexus.com/interactivecloudcomputing/jupyterlab/introduction) to learn how to load a snapshot and launch JupyterLab.

#### Instance Type Selection and Kernel Selection

* Instance times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability.
* GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.&#x20;
* Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.&#x20;

| Notebook                                                 | Instance type        | Kernel          |
| -------------------------------------------------------- | -------------------- | --------------- |
| Tahoe\_pca\_tutorial\_cpu\_dxjupyterlab-2026-04-14.ipynb | mem2\_ssd1\_v2\_x16  | Python 3.12 CPU |
| Tahoe\_pca\_tutorial\_gpu\_dxjupyterlab-2026-04-14.ipynb | mem2\_ssd1\_gpu\_x48 | Python 3.12 GPU |
| Tahoe\_scDataset\_v3\_tutorial\_gpu-2026-04-14.ipynb     | mem2\_ssd1\_gpu\_x16 | Python 3.12 GPU |

* Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.
* In the snapshot snapshot-single\_cell-dxjupyterlab-2026-04-08.tar.gz, two conda environments are pre-configured: one for CPU and one for GPU. Please select the appropriate kernel for each notebook according to the table above.
* A note on Quality control filtering:
* We removed three low-quality cell lines → 47 cell lines remaining.
* We applied the “full” filtering option for quality control. The Tahoe dataset provides two QC levels: minimal and full. “full” applies stricter filtering.

### Video: Setting Up the Tahoe-100M Dataset Analysis on the DNAnexus Platform

{% embed url="<https://youtu.be/6RmcWE9QgtE>" %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.dnanexus.com/public-datasets-on-the-dnanexus-platform/single-cell/tahoe-100m.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.