Introduction to Datasets

Please note: in order to use the Cohort Browser on the Platform, an Apollo License is required.

Disclaimers for the Training Dataset

Assay Type

Source

Notes

Clinical

TCGA (via cBioPortal)

Data is publicly available ("full" 32 studies) from cBioPortal on October 17, 2024.

Expression

TCGA (via GDC)

Data is publicly available (RNA-Seq, STAR - Counts) from GDC from this page and is downloaded on May 16, 2025.

Somatic

TCGA (via GDC/cBioPortal)

Derived from public SNV, CNV, and Fusion data:

  • SNV data are publicly available and downloaded from GDC on October 17, 2024.

  • CNV Segmented copy number data (.SEG files) are publicly available and were downloaded from GDC on October 6, 2025.

  • Fusion data are publicly available and downloaded from cBioPortal on September 27, 2025.

  • Definitions for each of the Somatic Variants Types that were used for data ingestion are:

Germline

Synthetic Data Only

TCGA germline data is not publicly available. This component uses simulated genotypes.

General Overview

  • You can use both the phenotypical and genomic data when creating a cohort.

  • The phenotypic data (which is one database) is processed and combined with the genomic data (another database) to ensure that they are paired appropriately, and that forms a dataset.

  • You can then use the dataset in Apollo to perform various actions, such as visualizing the data, analyzing all of part of it (called a cohort), and collaborate with others about a particular dataset

High Level Structure of Datasets

Each dataset has an important structure.

First, a data set lies on top of a database. A data set can be copied and moved around the platform, and even deleted. A database, however, cannot without the ingestion process having to be repeated.

Datasets are the top level structure of the data.

Each dataset has entities, which are equivalent to tables. The tables contain fields.

Fields are the variables.

The graphic below also explains the relationship:

Structure of a Dataset

Data sets are patient- centric. All the information goes back to the patient.

This is important for filtering. If a patient, for example, takes a medication more than once during the progression of their illness, there will be more instance types for the medication than there are people in the cohort.

Here is a summary graphic of how the data is considered to be patient- centric:

Datasets, Databases, and Spark

  • Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.

  • A dataset can be thought of as a giant multi-omics matrix.

  • Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across omics type

  • Underlying Apollo is a technology called Spark. All data in Apollo is stored in it.

  • It is made to handle very large datasets and enable fast queries that can't be handled by single computers.

  • It does this by creating RDDs (resilient distributed datasets), which are distributed across the worker nodes. Each node handles only part of the query and reports its back, which is why the queries are very fast.

  • Details about RDDs can be found here and here

  • Spark databases mean you can query across many columns in the dataset relatively quickly, compared to using a single computer.

Datasets, Cohorts, and Dashboards

  • Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.

  • A dataset can be thought of as a giant multi-omics matrix,

  • Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across genomics type

Last updated

Was this helpful?