scvi-tools and CZ CELLxGENE

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with analyzing the CZ CELLxGENE dataset and scvi-tools and its storage in their project spaces.

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

Introduction

This user guide presents a series of four Jupyter notebooks for single-cell RNA-seq (scRNA-seq) analysis on the DNAnexus platform. The workflows leverage two key resources: scvi-tools, a Python library for probabilistic deep learning and generative modeling of single-cell omics data, and CELLxGENE Census, a large-scale data resource that provides standardized, programmatic access to millions of annotated human and mouse cells.

The notebooks guide users through a complete analytical workflow, from querying and retrieving datasets from the Census, to preprocessing raw count matrices, training variational autoencoder (VAE) models with scVI, and visualizing batch-corrected embeddings with UMAP. Both CPU and GPU compute environments are supported, with GPU-accelerated notebooks recommended for faster model training on large datasets.

All notebooks are built to run within the Jupyterlab with Python, R, Stata, ML, Image Processing on DNAnexus, using a pre-configured snapshot that eliminates the need for manual package installation.

List of notebooks

Notebook

Purpose

Compute

Cellxgene_census_data_fetching_cpu-2026-04-09.ipynb

Access & query CELLxGENE Census data

CPU

introduction_scvi_tools_tutorial_gpu-2026-04-14.ipynb

End-to-end scVI workflow on a tumor microenvironment atlas

GPU

Cellxgene_census_data_integration_scvi_cpu-2026-04-09.ipynb

Batch integration of multi-dataset T-cell slices

CPU

Cellxgene_census_data_integration_scvi_gpu-2026-04-08.ipynb

Batch integration of multi-dataset T-cell slices

GPU

Citations for the scvi-tools and CZ CELLxGENE dataset

For this demonstration, we adapted the Introduction to scvi-tools notebook, developed by the scvi-tools development team. Users may cite scvi-tools manuscript published in 2022 along with the original papers describing each model, which are referred to in the corresponding documentation. In this example, we applied the scVI (single-cell Variational Inference) model; its description is available in the publication Deep generative modeling for single-cell transcriptomics.

The scVI model is trained on the human single-cell RNA-seq dataset downloaded from the CZ CELLxGENE data portal. Cite the publication associated with this dataset: Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome

CZ CELLxGENE brings together a wide range of public single-cell datasets that have been shared through the Chan Zuckerberg Initiative platform. These datasets are uploaded by the original researchers and distributed under the creative commons CC BY 4.0 license. More information may be found in the CZ CELLxGENE Data Submission Policy.

Cite CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data CZI Single-Cell Biology, et al. bioRxiv 2023.10.30; doi: https://doi.org/10.1101/2023.10.30.563174

Overview of scvi-tools

scvi-tools is a software ecosystem designed for fully processing and modeling single-cell omics datasets. The project originates from work carried out in the Yosef Lab at UC Berkeley in collaboration with researchers at the Weizmann Institute of Science. The toolkit can be thought of in two parts:

  • it offers an accessible interface for applying various probabilistic methods to single-cell data (including models like scVI, scANVI, and totalVI), and

  • It provides a framework for constructing new probabilistic approaches using the PyTorch, PyTorch Lightning, and Pyro libraries.

On DNAnexus, we provide a notebook that demonstrates an end-to-end single-cell RNA-seq workflow using scvi-tools, covering data preprocessing, model training, and differential expression analysis. The notebook was run with scvi-tools version 1.4.0. Please refer to the release note for more details. The scVI model description can be found in scvi’s user guide

Figure: Overview of the encoder–decoder framework for batch-corrected single-cell RNA-seq representation learning. Raw count data and covariates are encoded into a latent space that captures biological signals while removing technical effects, and subsequently decoded to reconstruct gene expression levels and dropout probabilities.

Overview of CZ CELLxGENE dataset

The dataset originates from the study “Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome” and is available on CZ CELLxGENE under the title “A multi-tissue single-cell tumor microenvironment atlas”.

It aggregates nearly 400,000 single-cell transcriptomic profiles from 13 independent studies covering eight tumor and non-tumor tissue sources (including breast, colorectal, ovary, lung, liver, skin, uvea, and PBMC). It brings together samples collected from normal tissue, primary tumors, lymph nodes, and peripheral blood, generated across three commonly used single-cell RNA-seq technologies (10x, Smart-seq2, and inDrop). The atlas provides detailed annotations of major cellular populations, with a particular emphasis on characterizing myeloid-derived cell states. At DNAnexus, we have downloaded this data for your use on the platform. See the “Where to Access Data Asset” section below to start accessing the dataset.

For the data integration notebooks, a multi-dataset T-cell slice of 89,481 cells is queried directly from the CELLxGENE Census (census_version="2025-11-08") using the Python API, filtering for T cells from COVID-19 and normal blood samples across multiple publications. No manual download is required for this dataset.

Where to Access Data Asset

The following data are available on DNAnexus

To use the dataset and notebooks, please copy the data and notebooks into your own project space. Details on how to copy the data are present under the section titled "Copying Data and Notebook into a Project".

Running scvi-tools on DNAnexus

Copying Data and Notebooks into a Project

To utilize the dataset, please copy the data from the project linked above into your own project. Here are the steps to copy the data into a Project Space:

  1. Create a project for your single cell analysis, billed to your own organization. Tutorials on how to set up a project can be found on this page.

  2. Go to Resources Tab and find the project titled “Public Datasets Region”. Select the Dataset Project that matches your region and select the folder "Single_cell_analysis".

  3. Select the data folder and the notebook

  4. Select "Copy" on the top right menu, and select the project that you created in Step 1.

  5. Then, go to the project space you created in Step 1 to start exploring the CZ CELLxGENE dataset and scvi-tools notebook.

  6. To run the JupyterLab Notebooks, please see the JupyterLab section of the Academy Documentation section.

Download Data from CZ CELLxGENE to DNAnexus

Here, we show how to download an example dataset from CZ CELLxGENE. The dataset is from the paper: Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome.

  1. In the Data Availability section of the paper, open the CZ CELLxGENE dataset link: A multi-tissue single-cell tumor microenvironment atlas

  2. On the CZ CELLxGENE page, click Download

  3. Select Browser

  4. In Download Details, click Copy to copy the download URL

Please follow the Importing Data into DNAnexus tutorial to download the dataset to your project.

Running the Notebooks

  • The notebook is optimized for the JupyterLab with Python, R, Stata, ML, Image Processing (v2.11.0). If you do not have access, please contact the Success Team at [email protected] or the Sales Team at [email protected]

  • Load the snapshot when you launch JupyerLab: snapshot-single_cell-dxjupyterlab-2026-04-08.tar.gz in Notebook_snapshot folder

  • Please follow Introduction to JupyterLab to learn how to load a snapshot and launch JupyterLab.

Instance Type Selection and Kernel Selection

  • Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability

  • GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.

  • Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

Notebook

Instance Type

Kernel

Cellxgene_census_data_fetching_cpu-2026-04-09.ipynb

mem3_ssd1_v2_x16

Python 3.12 CPU

introduction_scvi_tools_tutorial_gpu-2026-04-14.ipynb

mem2_ssd1_gpu_x16

Python 3.12 GPU

Cellxgene_census_data_integration_scvi_cpu-2026-04-09.ipynb

mem3_ssd1_v2_x16

Python 3.12 CPU

Cellxgene_census_data_integration_scvi_gpu-2026-04-08.ipynb

mem2_ssd1_gpu_x48

Python 3.12 GPU

  • Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.

  • In the snapshot snapshot-single_cell-dxjupyterlab-2026-04-08.tar.gz, two conda environments are pre-configured: one for CPU and one for GPU. Please select the appropriate kernel for each notebook according to the table above.

  • For the introduction_scvi_tools_tutorial_gpu-2026-04-14.ipynb, ensure the data file A_multi_tissue_single_cell_tumor_microenvironment_atlas.h5ad is available in your DNAnexus project and update the project_id and data_path variables in the notebook before running

Video: Utilizing scvi-tools and CZ CELLxGENE on the DNAnexus Platform

Last updated