Introduction to Datasets
Please note: in order to use the Cohort Browser on the Platform, an Apollo License is required.
Disclaimers for the Training Dataset
Assay Type
Source
Notes
Clinical
TCGA (via cBioPortal)
Data is publicly available ("full" 32 studies) from cBioPortal on October 17, 2024.
Expression
TCGA (via GDC)
Data is publicly available (RNA-Seq, STAR - Counts) from GDC from this page and is downloaded on May 16, 2025.
Somatic
TCGA (via GDC/cBioPortal)
Derived from public SNV, CNV, and Fusion data:
SNV data are publicly available and downloaded from GDC on October 17, 2024.
CNV Segmented copy number data (.SEG files) are publicly available and were downloaded from GDC on October 6, 2025.
Fusion data are publicly available and downloaded from cBioPortal on September 27, 2025.
Definitions for each of the Somatic Variants Types that were used for data ingestion are:
Germline
Synthetic Data Only
TCGA germline data is not publicly available. This component uses simulated genotypes.
General Overview

You can use both the phenotypical and genomic data when creating a cohort.
The phenotypic data (which is one database) is processed and combined with the genomic data (another database) to ensure that they are paired appropriately, and that forms a dataset.
You can then use the dataset in Apollo to perform various actions, such as visualizing the data, analyzing all of part of it (called a cohort), and collaborate with others about a particular dataset
High Level Structure of Datasets
Each dataset has an important structure.
First, a data set lies on top of a database. A data set can be copied and moved around the platform, and even deleted. A database, however, cannot without the ingestion process having to be repeated.
Datasets are the top level structure of the data.
Each dataset has entities, which are equivalent to tables. The tables contain fields.
Fields are the variables.
The graphic below also explains the relationship:

Structure of a Dataset
Data sets are patient- centric. All the information goes back to the patient.
This is important for filtering. If a patient, for example, takes a medication more than once during the progression of their illness, there will be more instance types for the medication than there are people in the cohort.
Here is a summary graphic of how the data is considered to be patient- centric:

Datasets, Databases, and Spark

Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.
A dataset can be thought of as a giant multi-omics matrix.
Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across omics type
Underlying Apollo is a technology called Spark. All data in Apollo is stored in it.
It is made to handle very large datasets and enable fast queries that can't be handled by single computers.
It does this by creating RDDs (resilient distributed datasets), which are distributed across the worker nodes. Each node handles only part of the query and reports its back, which is why the queries are very fast.
Spark databases mean you can query across many columns in the dataset relatively quickly, compared to using a single computer.
Datasets, Cohorts, and Dashboards

Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.
A dataset can be thought of as a giant multi-omics matrix,
Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across genomics type
Last updated
Was this helpful?