Tahoe 100M

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with analyzing the Tahoe 100M dataset and its storage in their project spaces.

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

Citations for Tahoe 100M Dataset

The paper titled Tahoe-100M: A Giga-Scale Single-Cell Perturbation Atlas for Context-Dependent Gene Function and Cellular Modelingarrow-up-right describes how the Tahoe 100M dataset was curated and is currently published on BioRXiv.

Tahoe 100M is hosted as part of Arc Institute’s Virtual Cell Atlasarrow-up-right.

The instructions for Arc Institute’s official version of the dataset is hosted on their Github.arrow-up-right

Overview of the Tahoe 100M Dataset

The Tahoe 100M dataset was generated using Tahoe’s Mosaic Platform in partnership with Parse Biosciences and Ultima Genomics. This 100 million single cell dataset has been curated to accelerate discovery through modeling of gene-drug and gene-gene interactions at a single cell level, training AI/ML models grounded in single cell biology, mapping drug responses across cell types and states, and benchmarking and validating modeling with confidence due to the size. These use cases are further illustrated in the figure below:

The Tahoe 100M dataset is now available as part of the Arc Institute’s Virtual Cell Atlas, which is openly accessible for scientific use. At DNAnexus, we have downloaded this data for your use on the platform without having to download or set it up further. See the “Where to Access Tahoe 100M” section below to start accessing the dataset.

Where to Access Tahoe 100M

The following files are available for the Tahoe 100M dataset:

To use the dataset and notebooks, please copy the data and notebooks into your own project space. Details on how to copy the data are present under the section titled "Copying Data and Notebooks into a Project".

Running Analyses on Tahoe 100M

Copying Data and Notebooks into a Project

To utilize the dataset, please copy the data from this projectarrow-up-right into your own project.

Here are the steps to copy the Tahoe-100M data into a Project Space:

  1. Create a project for your Tahoe 100M dataset, billed to your own organization. Tutorials on how to set up a project can be found on this pagearrow-up-right.

  2. Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder "Tahoe-100M".arrow-up-right

  3. Select the data folder and the notebooks

  4. Select "Copy" on the top right menu, and select the project that you created in Step 1.

  5. Then, go to the project space you created in Step 1 to start exploring the Tahoe 100M dataset and notebooks.

  6. To run the JupyterLab Notebooks, please see the JupyterLab section of the Academy Documentation or the AI/ ML Accelerator- ML JupyterLab section, depending on the notebooks that you are selecting and the apps that you have access to.

Instance Type Selection

  • Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability

  • GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.

  • Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

  • Requirements for notebooks that are present in the Tahoe 100M folder:

    • Notebooks that are optimized for the JupyterLab with Python, R, Stata, ML, Image Processing App.

      • scDataset notebookarrow-up-right is named "Tahoe_scDataset_tutorial.ipynb"

        • Instance type to use: mem2_ssd1_gpu_x16

        • Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.

      • PCA notebook example for GPU arrow-up-rightis titled "Tahoe_pca_gpu_ml_dxjupyterlab.ipynb"

        • Instance type to use: mem2_ssd1_gpu_x48

        • Select "ML" as the feature.

        • Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.

    • Notebooks that are optimized for the AI/ ML Accelerator- ML JupyterLab App. If you would like to utilize AI/ ML Accelerator and do not have access, please contact the Success Team at [email protected] or the Sales Team at [email protected].

      • PCA notebook example for CPU arrow-up-right is titled "Tahoe_pca_cpu_ai_ml_accelerator_tutorial.ipynb"

        • Instance type to use: mem2_ssd1_gpu_x16

        • Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.

    • A note on Quality control filtering:

      • We removed three low-quality cell lines → 47 cell lines remaining.

      • We applied the “full” filtering option for quality control. The Tahoe dataset provides two QC levels: minimal and full. “full” applies stricter filtering.

Last updated

Was this helpful?