Apollo Introduction
Last updated
Was this helpful?
Last updated
Was this helpful?
Spark: Apache framework that allows you to do machine learning and querying of very large datasets
Parquet Files: Storage format for data in Apollo. Designed to enable fast queries across columns with SparkSQL
Filter: Criteria applied to a dataset to produce a cohort. Much like Excel filters.Heading 2: Beginning of Course
Cohort: Subset of dataset that has had filters applied to it
Dataset: Abstraction that combines multiple databases, enabling fast querying across them using Spark.
Database: Physical storage of the fields in Parquet format.
Apollo is a layer that works on top of our Titan platform. Titan handles projects, files, and security. Titan apps are usually for secondary analysis of data.
Apollo works with datasets, which are aggregations of Apollo databases. Apollo apps are for tertiary analysis of data, enabling you to ask questions of multi-omics data (including phenotypic/clinical covariates and genomic features). Querying leverages the ability of Apache spark to query extremely large multi-OMICs datasets.
Titan is about managing the files, whereas Apollo is about investigating the populations.
You can use both the phenotypical and genomic data in order to make a cohort
The phenotypic data (which is one database) is processed and combined with the genomic data (another database) to ensure that they are paired appropriately, and that forms a dataset.
You can then use the dataset in Apollo to perform various actions, such as visualizing the data, analyzing all of part of it (called a cohort), and collaborate with others about a particular dataset
Each dataset has a important structure.
First, a data set lies on top of a database. A data set can be copied and moved around the platform, and even deleted. A database, however, cannot without the ingestion process having to be repeated.
Datasets are the top level structure of the data.
Each dataset has entities, which are equivalent to tables. These contain fields.
Fields are the variables.
The graphic below also explains the relationship:
Data sets are patient- centric. All the information goes back to the patient.
This is important for filtering. If a patient, for example, takes a medication more than once during the progression of their illness, there will be more instance types for the medication than there are people in the cohort.
Here is a summary graphic of how the data is considered to be patient- centric:
Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.
A dataset can be thought of as a giant multi-omics matrix.
Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across omics type
Underlying Apollo is a technology called Spark. All data in Apollo is stored in it.
It is made to handle very large datasets and enable fast queries that can't be handled by single computers.
It does this by creating RDDs (resilient distributed datasets), which are distributed across the worker nodes. Each node handles only part of the query and reports its back, which is why the queries are very fast.
Spark databases mean you can query across many columns in the dataset relatively quickly, compared to using a single computer.
Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.
A dataset can be thought of as a giant multi-omics matrix,
Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across genomics type
You can build a lot on top of the Spark databases and the visual showcases the process of getting data into Apollo and what you can do with it.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Details about RDDs can be found and