Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • Glossary of Terms
  • Apollo vs Titan
  • Apollo Structure
  • Solutions with Apollo
  • Resources

Was this helpful?

Export as PDF
  1. Cohort Browser

Apollo Introduction

PreviousCohort BrowserNextOverview of the Cohort Browser

Last updated 4 months ago

Was this helpful?

Glossary of Terms

  • Spark: Apache framework that allows you to do machine learning and querying of very large datasets

  • Parquet Files: Storage format for data in Apollo. Designed to enable fast queries across columns with SparkSQL

  • Filter: Criteria applied to a dataset to produce a cohort. Much like Excel filters.Heading 2: Beginning of Course

  • Cohort: Subset of dataset that has had filters applied to it

  • Dataset: Abstraction that combines multiple databases, enabling fast querying across them using Spark.

  • Database: Physical storage of the fields in Parquet format.

Apollo vs Titan

  • Apollo is a layer that works on top of our Titan platform. Titan handles projects, files, and security. Titan apps are usually for secondary analysis of data.

  • Apollo works with datasets, which are aggregations of Apollo databases. Apollo apps are for tertiary analysis of data, enabling you to ask questions of multi-omics data (including phenotypic/clinical covariates and genomic features). Querying leverages the ability of Apache spark to query extremely large multi-OMICs datasets.

  • Titan is about managing the files, whereas Apollo is about investigating the populations.

Apollo Structure

General Overview

  • You can use both the phenotypical and genomic data in order to make a cohort

  • The phenotypic data (which is one database) is processed and combined with the genomic data (another database) to ensure that they are paired appropriately, and that forms a dataset.

  • You can then use the dataset in Apollo to perform various actions, such as visualizing the data, analyzing all of part of it (called a cohort), and collaborate with others about a particular dataset

High Level Structure of Apollo Datasets

Each dataset has a important structure.

First, a data set lies on top of a database. A data set can be copied and moved around the platform, and even deleted. A database, however, cannot without the ingestion process having to be repeated.

Datasets are the top level structure of the data.

Each dataset has entities, which are equivalent to tables. These contain fields.

Fields are the variables.

The graphic below also explains the relationship:

Structure of a Dataset

Data sets are patient- centric. All the information goes back to the patient.

This is important for filtering. If a patient, for example, takes a medication more than once during the progression of their illness, there will be more instance types for the medication than there are people in the cohort.

Here is a summary graphic of how the data is considered to be patient- centric:

Datasets, Databases, and Spark

  • Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.

  • A dataset can be thought of as a giant multi-omics matrix.

  • Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across omics type

  • Underlying Apollo is a technology called Spark. All data in Apollo is stored in it.

  • It is made to handle very large datasets and enable fast queries that can't be handled by single computers.

  • It does this by creating RDDs (resilient distributed datasets), which are distributed across the worker nodes. Each node handles only part of the query and reports its back, which is why the queries are very fast.

  • Spark databases mean you can query across many columns in the dataset relatively quickly, compared to using a single computer.

Datasests, Cohorts, and Dashboards

  • Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.

  • A dataset can be thought of as a giant multi-omics matrix,

  • Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across genomics type

Solutions with Apollo

You can build a lot on top of the Spark databases and the visual showcases the process of getting data into Apollo and what you can do with it.

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Details about RDDs can be found and

here
here
Full Documentation