Tahoe 100M

Necessary Disclaimers and Legal

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with analyzing the Tahoe 100M dataset and its storage in their project spaces.

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

Citations for Tahoe 100M Dataset

The paper titled Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modeling describes how the Tahoe 100M dataset was curated and is currently published on BioRXiv.

Tahoe 100M is hosted as part of Arc Institute’s Virtual Cell Atlas.

The instructions for Arc Institute’s official version of the dataset is hosted on their Github.

Overview of the Tahoe 100M Dataset

The Tahoe 100M dataset was generated using Tahoe’s Mosaic Platform in partnership with Parse Biosciences and Ultima Genomics. This 100 million single cell dataset has been curated to accelerate discovery through modeling of gene-drug and gene-gene interactions at a single cell level, training AI/ML models grounded in single cell biology, mapping drug responses across cell types and states, and benchmarking and validating modeling with confidence due to the size. These use cases are further illustrated in the figure below:

The Tahoe 100M dataset is now available as part of the Arc Institute’s Virtual Cell Atlas, which is openly accessible for scientific use. At DNAnexus, we have downloaded this data for your use on the platform without having to download or set it up further. See the “Where to Access Tahoe 100M” section below to start accessing the dataset.

Where to Access Tahoe 100M

The following files are available for the Tahoe 100M dataset:

The original set of files retrieved directly from Arc Institute’s GCP storage. The location of these files are here on the platform.
The AnnData files are converted to Parquet files, for users who would prefer to use big data analytics tools, including Spark, to analyze the Tahoe-100M data. The location of these files are here on the platform.
Notebooks to analyze the Tahoe 100M dataset can be found here on the platform. The file endings are .ipynb .

To use the dataset and notebooks, please copy the data and notebooks into your own project space. Details on how to copy the data are present under the section titled "Copying Data and Notebooks into a Project".

Running Analyses on Tahoe 100M

Copying Data and Notebooks into a Project

To utilize the dataset, please copy the data from this project into your own project.

Here are the steps to copy the Tahoe-100M data into a Project Space:

Create a project for your Tahoe 100M dataset, billed to your own organization. Tutorials on how to set up a project can be found on this page.
Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder "Tahoe-100M".
Select the data folder and the notebooks
Select "Copy" on the top right menu, and select the project that you created in Step 1.
Then, go to the project space you created in Step 1 to start exploring the Tahoe 100M dataset and notebooks.
To run the JupyterLab Notebooks, please see the JupyterLab section of the Academy Documentation or the AI/ ML Accelerator- ML JupyterLab section, depending on the notebooks that you are selecting and the apps that you have access to.

Instance Type Selection

Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability
GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Requirements for notebooks that are present in the Tahoe 100M folder:
- Notebooks that are optimized for the JupyterLab with Python, R, Stata, ML, Image Processing App.
  - scDataset notebook is named "Tahoe_scDataset_tutorial.ipynb"
    Instance type to use: mem2_ssd1_gpu_x16
    Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.
- Notebooks that are optimized for the AI/ ML Accelerator- ML JupyterLab App. If you would like to utilize AI/ ML Accelerator and do not have access, please contact the Success Team at [email protected] or the Sales Team at [email protected].
  - PCA notebook example for CPU is titled "Tahoe_pca_cpu_ai_ml_accelerator_tutorial.ipynb"
    Instance type to use: mem2_ssd1_gpu_x16
    Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.
  - PCA notebook example for GPU is titled "Tahoe_pca_gpu_ai_ml_accelerator_tutorial.ipynb"
    Instance typeto use: mem2_ssd1_gpu_x48
    Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.
- A note on Quality control filtering:
  - We removed three low-quality cell lines → 47 cell lines remaining.
  - We applied the “full” filtering option for quality control. The Tahoe dataset provides two QC levels: minimal and full. “full” applies stricter filtering.

PreviousPublic Datasets on the DNAnexus Platform

Last updated 1 month ago

Was this helpful?