Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • HPC vs the DNAnexus Platform
  • Key Players with an HPC
  • Key Players in Cloud Computing
  • Key Differences
  • Transferring Files
  • Resolution:
  • Running apps
  • Why do this with DNAnexus?
  • Equivalent Commands
  • Practical Approaches
  • Batch Processing Comparisons
  • Resources

Was this helpful?

Export as PDF
  1. Cloud Computing

Cloud Computing for HPC Users

PreviousCloud Computing for ScientistsNextOverview of the Platform

Last updated 9 months ago

Was this helpful?

HPC vs the DNAnexus Platform

Component
HPC
DNAnexus Platform

Driver/ Requestor

Head Node of Cluster

API Server

Submission Script Language

Portable Bash System (PBS) or SLURM

dx-toolkit

Worker

Requested from pool of machines in private cluster

requested from pool of machines in AWS/ Azure

Shared Storage

Shared file system for all nodes (Lustre, GPFS, etc)

Project storage (Amazon S3/ Azure storage)

Worker File I/O

Handled by Shared file system

needs to be transferred to and from project storage my commands on worker

Key Players with an HPC

  • With an HPC, there is a collection of specialized hardware, including mainframe computers, as well as a distributed processing software framework so that the incredibly large computer system can handle massive amounts of data and processing at high speeds.

  • The goal of an HPC is to have the files on the hardware and to also do the analysis on it. In this way, it is similar to a local computer, but with more specialty hardware and software to have more data and processing power.

  • Your computer: this communicates with the HPC cluster for resources

  • HPC Cluster

    • Shared Storage: common area for where files are stored. You may have directories branching out by users or in another format

    • Head Node: manages the workers and the shared storage

    • HPC Worker: is where we do our computation and is part of the HPC cluster.

  • These work together to increase processing power and to have jobs and queues so that when the amount of workers that are needed are available, the jobs can run.

Key Players in Cloud Computing

  • In comparison, cloud computing adds layers into analysis to increase computational power and storage.

  • This relationship and the layers involved are in the figure below:

  • Let's contrast this with processing a file on the DNAnexus platform.

    • We'll start with our computer, the DNAnexus platform, and a file from project storage.

    • We first use the dx run command, requesting to run an app on a file in project storage. This request is then sent to the platform, and an appropriate worker from the pool of workers is made available.

    • When the worker is available, we can transfer a file from the project to the worker.

    • The platform handles installing the app and its software environment to the worker as well.

    • Once our app is ready and our file is set, we can run the computation on the worker.

    • Any files that we generate must be transferred back into project storage.

Key Differences

  • HPC jobs are limited by how many workers are physically present on the HPC.

  • Traditionally, cloud computing has better architecture than an HPC, so the jobs are more efficient.

Transferring Files

  • One common barrier is getting our files onto the worker from project storage, and then doing computations with them on the worker. The last barrier we'll review is getting the file outputs we've generated from the worker back into the project storage.

  • Cloud computing has a nestedness to it and transferring files can make learning it difficult.

  • A mental model of how cloud computing works can help us overcome these barriers.

Resolution:

  • Cloud computing is indirect, and you need to think 2 steps ahead.

  • Here is the visual for thinking about the steps for file management:

Running apps

Creating apps and running them is covered later in the documentation.

Apps serve to (at minimum):

  1. Request an EC2/Azure worker

  2. Configure the worker's environment

  3. Establish data transfer

Why do this with DNAnexus?

  • Highly secure platform with built-in compliance infrastructure

  • Fully configurable platform

    • User can run single scripts to fully-automated, production-level workflows

  • Data transfer designed to be fast and efficient

    • Read and analyze massive files directly using dxfuse

  • Instances are configured for you via apps

    • Variety of ways to configure your own environments

    • Largest Azure instances: ~4Tb RAM

    • Largest AWS instances: ~2Tb RAM

Equivalent Commands

Task
dx-toolkit
PBS
SLURM

Run Job

dx run <app-id> <script>

qsub <script>

sbatch <script>

Monitor Job

dx find jobs

qstat

squeue

Kill Job

dx terminate <jobid>

qdel <jobid>

scancel <jobid>

Practical Approaches

  • Single Job

    • Use `dx run` on the CLI directly

    • Use `dx run` in a shell script

    • Use a shell script to use `dx run` on multiple files

    • Use dxFUSE to directly access files (read only)

Batch Processing Comparisons

Component
HPC Recipe
Cloud Recipe

1

List Files

List Files

2

Request 1 worker/ file

Use loop for each file: 1) use dx run, 2) transfer file, and 3) run commands

3

use array ids to process 1 file/worker

4

submit job to head node

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Access to the wealth of

/ dx run --batch-tsv

AWS/Azure resources
Batch Processing
dx generate-batch-inputs
Full Documentation