scvi-tools and CZ CELLxGENE
Necessary Disclaimers and Legal
The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.
Users are responsible for the costs associated with analyzing the CZ CELLxGENE dataset and scvi-tools and its storage in their project spaces.
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Introduction
This user guide presents a series of four Jupyter notebooks for single-cell RNA-seq (scRNA-seq) analysis on the DNAnexus platform. The workflows leverage two key resources: scvi-tools, a Python library for probabilistic deep learning and generative modeling of single-cell omics data, and CELLxGENE Census, a large-scale data resource that provides standardized, programmatic access to millions of annotated human and mouse cells.
The notebooks guide users through a complete analytical workflow, from querying and retrieving datasets from the Census, to preprocessing raw count matrices, training variational autoencoder (VAE) models with scVI, and visualizing batch-corrected embeddings with UMAP. Both CPU and GPU compute environments are supported, with GPU-accelerated notebooks recommended for faster model training on large datasets.
All notebooks are built to run within the Jupyterlab with Python, R, Stata, ML, Image Processing on DNAnexus, using a pre-configured snapshot that eliminates the need for manual package installation.
List of notebooks
Notebook
Purpose
Compute
Cellxgene_census_data_fetching_cpu-2026-04-09.ipynb
Access & query CELLxGENE Census data
CPU
introduction_scvi_tools_tutorial_gpu-2026-04-14.ipynb
End-to-end scVI workflow on a tumor microenvironment atlas
GPU
Cellxgene_census_data_integration_scvi_cpu-2026-04-09.ipynb
Batch integration of multi-dataset T-cell slices
CPU
Cellxgene_census_data_integration_scvi_gpu-2026-04-08.ipynb
Batch integration of multi-dataset T-cell slices
GPU
Citations for the scvi-tools and CZ CELLxGENE dataset
For this demonstration, we adapted the Introduction to scvi-tools notebook, developed by the scvi-tools development team. Users may cite scvi-tools manuscript published in 2022 along with the original papers describing each model, which are referred to in the corresponding documentation. In this example, we applied the scVI (single-cell Variational Inference) model; its description is available in the publication Deep generative modeling for single-cell transcriptomics.
The scVI model is trained on the human single-cell RNA-seq dataset downloaded from the CZ CELLxGENE data portal. Cite the publication associated with this dataset: Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome
CZ CELLxGENE brings together a wide range of public single-cell datasets that have been shared through the Chan Zuckerberg Initiative platform. These datasets are uploaded by the original researchers and distributed under the creative commons CC BY 4.0 license. More information may be found in the CZ CELLxGENE Data Submission Policy.
Cite CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data CZI Single-Cell Biology, et al. bioRxiv 2023.10.30; doi: https://doi.org/10.1101/2023.10.30.563174
Overview of scvi-tools
scvi-tools is a software ecosystem designed for fully processing and modeling single-cell omics datasets. The project originates from work carried out in the Yosef Lab at UC Berkeley in collaboration with researchers at the Weizmann Institute of Science. The toolkit can be thought of in two parts:
it offers an accessible interface for applying various probabilistic methods to single-cell data (including models like scVI, scANVI, and totalVI), and
It provides a framework for constructing new probabilistic approaches using the PyTorch, PyTorch Lightning, and Pyro libraries.
On DNAnexus, we provide a notebook that demonstrates an end-to-end single-cell RNA-seq workflow using scvi-tools, covering data preprocessing, model training, and differential expression analysis. The notebook was run with scvi-tools version 1.4.0. Please refer to the release note for more details. The scVI model description can be found in scvi’s user guide

Overview of CZ CELLxGENE dataset
The dataset originates from the study “Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome” and is available on CZ CELLxGENE under the title “A multi-tissue single-cell tumor microenvironment atlas”.
It aggregates nearly 400,000 single-cell transcriptomic profiles from 13 independent studies covering eight tumor and non-tumor tissue sources (including breast, colorectal, ovary, lung, liver, skin, uvea, and PBMC). It brings together samples collected from normal tissue, primary tumors, lymph nodes, and peripheral blood, generated across three commonly used single-cell RNA-seq technologies (10x, Smart-seq2, and inDrop). The atlas provides detailed annotations of major cellular populations, with a particular emphasis on characterizing myeloid-derived cell states. At DNAnexus, we have downloaded this data for your use on the platform. See the “Where to Access Data Asset” section below to start accessing the dataset.
For the data integration notebooks, a multi-dataset T-cell slice of 89,481 cells is queried directly from the CELLxGENE Census (census_version="2025-11-08") using the Python API, filtering for T cells from COVID-19 and normal blood samples across multiple publications. No manual download is required for this dataset.
Where to Access Data Asset
The following data are available on DNAnexus
The AnnData file of “A multi-tissue single-cell tumor microenvironment atlas” was directly downloaded from CZ CELLxGENE portal. The file is stored in the DNAnexus project folder under the name: A_multi_tissue_single_cell_tumor_microenvironment_atlas.h5ad. The location of this file on the platform is here for AWS US East, here for AWS Europe (Frankfurt), here for AWS Europe (London), here for Azure Amsterdam, and here for Azure US (West).
Four example notebooks demonstrating the scvi-tools analysis workflows can be accessed on the platform here for AWS US East, here for AWS Europe (Frankfurt), here for AWS Europe (London), here for Azure Amsterdam, and here for Azure US (West). The file endings are .ipynb.
To use the dataset and notebooks, please copy the data and notebooks into your own project space. Details on how to copy the data are present under the section titled "Copying Data and Notebook into a Project".
Running scvi-tools on DNAnexus
Copying Data and Notebooks into a Project
To utilize the dataset, please copy the data from the project linked above into your own project. Here are the steps to copy the data into a Project Space:
Create a project for your single cell analysis, billed to your own organization. Tutorials on how to set up a project can be found on this page.
Go to Resources Tab and find the project titled “Public Datasets Region”. Select the Dataset Project that matches your region and select the folder "Single_cell_analysis".
Select the data folder and the notebook
Select "Copy" on the top right menu, and select the project that you created in Step 1.
Then, go to the project space you created in Step 1 to start exploring the CZ CELLxGENE dataset and scvi-tools notebook.
To run the JupyterLab Notebooks, please see the JupyterLab section of the Academy Documentation section.
Download Data from CZ CELLxGENE to DNAnexus
Here, we show how to download an example dataset from CZ CELLxGENE. The dataset is from the paper: Single-cell resolution characterization of myeloid-derived cell states with implication in cancer outcome.
In the Data Availability section of the paper, open the CZ CELLxGENE dataset link: A multi-tissue single-cell tumor microenvironment atlas
On the CZ CELLxGENE page, click Download
Select Browser
In Download Details, click Copy to copy the download URL
Please follow the Importing Data into DNAnexus tutorial to download the dataset to your project.
Running the Notebooks
The notebook is optimized for the JupyterLab with Python, R, Stata, ML, Image Processing (v2.11.0). If you do not have access, please contact the Success Team at [email protected] or the Sales Team at [email protected]
Load the snapshot when you launch JupyerLab: snapshot-single_cell-dxjupyterlab-2026-04-08.tar.gz in Notebook_snapshot folder
Please follow Introduction to JupyterLab to learn how to load a snapshot and launch JupyterLab.
Instance Type Selection and Kernel Selection
Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability
GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Notebook
Instance Type
Kernel
Cellxgene_census_data_fetching_cpu-2026-04-09.ipynb
mem3_ssd1_v2_x16
Python 3.12 CPU
introduction_scvi_tools_tutorial_gpu-2026-04-14.ipynb
mem2_ssd1_gpu_x16
Python 3.12 GPU
Cellxgene_census_data_integration_scvi_cpu-2026-04-09.ipynb
mem3_ssd1_v2_x16
Python 3.12 CPU
Cellxgene_census_data_integration_scvi_gpu-2026-04-08.ipynb
mem2_ssd1_gpu_x48
Python 3.12 GPU
Please follow the provided command-line instructions in the terminal that are found in the notebook example before running the notebook.
In the snapshot snapshot-single_cell-dxjupyterlab-2026-04-08.tar.gz, two conda environments are pre-configured: one for CPU and one for GPU. Please select the appropriate kernel for each notebook according to the table above.
For the introduction_scvi_tools_tutorial_gpu-2026-04-14.ipynb, ensure the data file A_multi_tissue_single_cell_tumor_microenvironment_atlas.h5ad is available in your DNAnexus project and update the project_id and data_path variables in the notebook before running
Video: Utilizing scvi-tools and CZ CELLxGENE on the DNAnexus Platform
Last updated