> For the complete documentation index, see [llms.txt](https://academy.dnanexus.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://academy.dnanexus.com/public-datasets-on-the-dnanexus-platform/single-cell/scvi-tools-and-cz-cellxgene.md).

# scvi-tools and CZ CELLxGENE

## Necessary Disclaimers and Legal

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with analyzing the CZ CELLxGENE dataset and scvi-tools and its storage in their project spaces.&#x20;

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

## Introduction

This user guide presents a series of four Jupyter notebooks for single-cell RNA-seq (scRNA-seq) analysis on the DNAnexus platform. The workflows leverage two key resources: scvi-tools, a Python library for probabilistic deep learning and generative modeling of single-cell omics data, and CELLxGENE Census, a large-scale data resource that provides standardized, programmatic access to millions of annotated human and mouse cells.

The notebooks guide users through a complete analytical workflow, from querying and retrieving datasets from the Census, to preprocessing raw count matrices, training variational autoencoder (VAE) models with scVI, and visualizing batch-corrected embeddings with UMAP. Both CPU and GPU compute environments are supported, with GPU-accelerated notebooks recommended for faster model training on large datasets.

All notebooks are built to run within the [Jupyterlab with Python, R, Stata, ML, Image Processing](https://academy.dnanexus.com/interactivecloudcomputing/jupyterlab/introduction) on DNAnexus, using a pre-configured snapshot that eliminates the need for manual package installation.

### List of notebooks

| Notebook                                                         | Purpose                                                    | Compute |
| ---------------------------------------------------------------- | ---------------------------------------------------------- | ------- |
| Cellxgene\_census\_data\_fetching\_cpu-2026-04-09.ipynb          | Access & query CELLxGENE Census data                       | CPU     |
| introduction\_scvi\_tools\_tutorial\_gpu-2026-04-14.ipynb        | End-to-end scVI workflow on a tumor microenvironment atlas | GPU     |
| Cellxgene\_census\_data\_integration\_scvi\_cpu-2026-04-09.ipynb | Batch integration of multi-dataset T-cell slices           | CPU     |
| Cellxgene\_census\_data\_integration\_scvi\_gpu-2026-04-08.ipynb | Batch integration of multi-dataset T-cell slices           | GPU     |

## Citations for the scvi-tools and CZ CELLxGENE dataset

For this demonstration, we adapted the [Introduction to scvi-tools notebook](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/quick_start/api_overview.html), developed by the scvi-tools development team. Users may cite [scvi-tools manuscript](https://www.nature.com/articles/s41587-021-01206-w) published in 2022 along with the original papers describing each model, which are referred to in the corresponding documentation. In this example, we applied the scVI (single-cell Variational Inference) model; its description is available in the publication [Deep generative modeling for single-cell transcriptomics](https://www.nature.com/articles/s41592-018-0229-2).&#x20;

The scVI model is trained on the human single-cell RNA-seq dataset downloaded from the [CZ CELLxGENE data portal](https://cellxgene.cziscience.com/collections/3f7c572c-cd73-4b51-a313-207c7f20f188). Cite the publication associated with this dataset: [Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome](https://www.nature.com/articles/s41467-024-49916-4)

[CZ CELLxGENE](https://cellxgene.cziscience.com/) brings together a wide range of public single-cell datasets that have been shared through the Chan Zuckerberg Initiative platform. These datasets are uploaded by the original researchers and distributed under the creative commons [CC BY 4.0 license.](https://creativecommons.org/licenses/by/4.0/) More information may be found in the [CZ CELLxGENE Data Submission Policy](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data).

Cite [CZ CELLxGENE Discover:](https://www.biorxiv.org/content/10.1101/2023.10.30.563174v1) A single-cell data platform for scalable exploration, analysis and modeling of aggregated data CZI Single-Cell Biology, et al. bioRxiv 2023.10.30; doi: <https://doi.org/10.1101/2023.10.30.563174>

## Overview of scvi-tools

scvi-tools is a software ecosystem designed for fully processing and modeling single-cell omics datasets. The project originates from work carried out in the Yosef Lab at UC Berkeley in collaboration with researchers at the Weizmann Institute of Science. The toolkit can be thought of in two parts:

* it offers an accessible interface for applying various probabilistic methods to single-cell data (including models like scVI, scANVI, and totalVI), and
* It provides a framework for constructing new probabilistic approaches using the PyTorch, PyTorch Lightning, and Pyro libraries.

On DNAnexus, we provide a notebook that demonstrates an end-to-end single-cell RNA-seq workflow using scvi-tools, covering data preprocessing, model training, and differential expression analysis. The notebook was run with scvi-tools version 1.4.0. Please refer to the [release note](https://docs.scvi-tools.org/en/stable/changelog.html) for more details. The scVI model description can be found in [scvi’s user guide](https://docs.scvi-tools.org/en/1.1.0/user_guide/models/scvi.html)

<img src="/files/XURfRuuhyiPjoqGoO16E" alt="Figure: Overview of the encoder–decoder framework for batch-corrected single-cell RNA-seq representation learning. Raw count data and covariates are encoded into a latent space that captures biological signals while removing technical effects, and subsequently decoded to reconstruct gene expression levels and dropout probabilities." height="280" width="624">

## Overview of CZ CELLxGENE dataset

The dataset originates from the study “[Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome](https://www.nature.com/articles/s41467-024-49916-4)” and is available on CZ CELLxGENE under the title “[A multi-tissue single-cell tumor microenvironment atlas”](https://cellxgene.cziscience.com/collections/3f7c572c-cd73-4b51-a313-207c7f20f188).&#x20;

It aggregates nearly 400,000 single-cell transcriptomic profiles from 13 independent studies covering eight tumor and non-tumor tissue sources (including breast, colorectal, ovary, lung, liver, skin, uvea, and PBMC). It brings together samples collected from normal tissue, primary tumors, lymph nodes, and peripheral blood, generated across three commonly used single-cell RNA-seq technologies (10x, Smart-seq2, and inDrop). The atlas provides detailed annotations of major cellular populations, with a particular emphasis on characterizing myeloid-derived cell states. At DNAnexus, we have downloaded this data for your use on the platform. See the “Where to Access Data Asset” section below to start accessing the dataset.

For the data integration notebooks, a multi-dataset T-cell slice of 89,481 cells is queried directly from the CELLxGENE Census (census\_version="2025-11-08") using the Python API, filtering for T cells from COVID-19 and normal blood samples across multiple publications. No manual download is required for this dataset.

## Where to Access Data Asset

The following data are available on DNAnexus

* The AnnData file of “A multi-tissue single-cell tumor microenvironment atlas” was directly downloaded from CZ CELLxGENE portal. The file is stored in the DNAnexus project folder under the name: A\_multi\_tissue\_single\_cell\_tumor\_microenvironment\_atlas.h5ad. The location of this file on the platform is [here for AWS US East](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/Single_cell_analysis/data), [here for AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/Single_cell_analysis/data), [here for AWS Europe (London),](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/Single_cell_analysis/data) [here for Azure Amsterdam,](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/Single_cell_analysis/data) and [here for Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/Single_cell_analysis/data).
* Four example notebooks demonstrating the scvi-tools analysis workflows can be accessed on the platform [here for AWS US East](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/Single_cell_analysis), [here for AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/Single_cell_analysis), [here for AWS Europe (London)](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/Single_cell_analysis), [here for Azure Amsterdam,](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/Single_cell_analysis) and [here for Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/Single_cell_analysis). The file endings are .ipynb.&#x20;

To use the dataset and notebooks, please copy the data and notebooks into your own project space. Details on how to copy the data are present under the section titled "Copying Data and Notebook into a Project".

## Running scvi-tools on DNAnexus

### Copying Data and Notebooks into a Project&#x20;

To utilize the dataset, please copy the data from the project linked above into your own project. Here are the steps to copy the data into a Project Space:

1. Create a project for your single cell analysis, billed to your own organization. Tutorials on how to set up a project can be [found on this page](https://academy.dnanexus.com/overview-of-the-platform/setting-up-a-project).
2. Go to Resources Tab and find the project titled “Public Datasets *Region*”. Select the Dataset Project that matches your region and select the folder "Single\_cell\_analysis". &#x20;
3. Select the data folder and the notebook
4. Select "Copy" on the top right menu, and select the project that you created in Step 1.&#x20;
5. Then, go to the project space you created in Step 1 to start exploring the CZ CELLxGENE dataset and scvi-tools notebook.
6. To run the JupyterLab Notebooks, please see the [JupyterLab section of the Academy Documentation](https://academy.dnanexus.com/interactivecloudcomputing/jupyterlab) section.

### Download Data from CZ CELLxGENE to DNAnexus

Here, we show how to download an example dataset from CZ CELLxGENE. The dataset is from the paper: [Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome](https://www.nature.com/articles/s41467-024-49916-4).&#x20;

1. In the Data Availability section of the paper, open the CZ CELLxGENE dataset link: [A multi-tissue single-cell tumor microenvironment atlas](https://cellxgene.cziscience.com/collections/3f7c572c-cd73-4b51-a313-207c7f20f188)
2. On the CZ CELLxGENE page, click Download
3. Select Browser
4. In Download Details, click Copy to copy the download URL

Please follow the [Importing Data into DNAnexus tutorial](https://academy.dnanexus.com/overview-of-the-platform/adding-data-to-a-project) to download the dataset to your project.

### Running the Notebooks

* The notebook is optimized for the JupyterLab with Python, R, Stata, ML, Image Processing (v2.11.0). If you do not have access, please contact the Success Team at <success@dnanexus.com> or the Sales Team at <sales@dnanexus.com>
* Load the snapshot when you launch JupyerLab: snapshot-single\_cell-dxjupyterlab-2026-04-08.tar.gz in Notebook\_snapshot folder
* Please follow [Introduction to JupyterLab](https://academy.dnanexus.com/interactivecloudcomputing/jupyterlab/introduction) to learn how to load a snapshot and launch JupyterLab.

#### Instance Type Selection and Kernel Selection

* Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability
* GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.&#x20;
* Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.&#x20;

| Notebook                                                         | Instance Type                         | Kernel                            |
| ---------------------------------------------------------------- | ------------------------------------- | --------------------------------- |
| Cellxgene\_census\_data\_fetching\_cpu-2026-04-09.ipynb          | <p>mem3\_ssd1\_v2\_x16</p><p><br></p> | <p>Python 3.12 CPU</p><p><br></p> |
| introduction\_scvi\_tools\_tutorial\_gpu-2026-04-14.ipynb        | mem2\_ssd1\_gpu\_x16                  | Python 3.12 GPU                   |
| Cellxgene\_census\_data\_integration\_scvi\_cpu-2026-04-09.ipynb | mem3\_ssd1\_v2\_x16                   | Python 3.12 CPU                   |
| Cellxgene\_census\_data\_integration\_scvi\_gpu-2026-04-08.ipynb | mem2\_ssd1\_gpu\_x48                  | Python 3.12 GPU                   |

* Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.
* In the snapshot snapshot-single\_cell-dxjupyterlab-2026-04-08.tar.gz, two conda environments are pre-configured: one for CPU and one for GPU. Please select the appropriate kernel for each notebook according to the table above.
* For the introduction\_scvi\_tools\_tutorial\_gpu-2026-04-14.ipynb, ensure the data file A\_multi\_tissue\_single\_cell\_tumor\_microenvironment\_atlas.h5ad is available in your DNAnexus project and update the project\_id and data\_path variables in the notebook before running

### Video: Utilizing scvi-tools and CZ CELLxGENE on the DNAnexus Platform

{% embed url="<https://youtu.be/Ok3yqz5UCB4>" %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.dnanexus.com/public-datasets-on-the-dnanexus-platform/single-cell/scvi-tools-and-cz-cellxgene.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.