Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • How Nextflow Works Locally
  • How Nextflow works on DNAnexus
  • Head Node
  • Subjobs
  • Work Directory
  • Note about Batch Processing
  • Resources

Was this helpful?

Export as PDF
  1. Building Workflows
  2. Nextflow

Overview of Nextflow

PreviousResources To Learn NextflowNextNextflow Setup

Last updated 9 months ago

Was this helpful?

How Nextflow Works Locally

Nextflow pipelines are composed of processes e.g., a task such as fastqc would be one process, then read trimming would be another process etc. Processes pass files between them using channels (queues) so every process usually has an input and output channel. Nextflow is implicitly parallel - if it can run something in parallel, it will! There is no need to loop over channels etc.

For example you could have a script with a fastqc and read_trimming processes which take in a fastq reads channel. As these two process have no links between them they will be run at the same time.

The Nextflow workflow file is called main.nf.

Lets think about a quick workflow that takes in some single-end fastq files, runs fastqc on them, then trims them, runs fastqc again and finally runs multiqc on the fastqc outputs.

An example of code that would achieve the workflow in the image (not showing what each process script looks like here)

nextflow.enable.dsl=2

//params.fastq_dir will be exposed as a pipeline input and is given a default here

params.fastq_dir = "./FASTQ/*.fq.gz"
//make a fastq ch
fastq_ch = Channel.fromPath(params.fastq_dir)

workflow {
//fastqc 
// takes in a fastq_ch and outputs a channel with fastqc html and zip files
raw_fastqc_ch = fastqc(fastq_ch)

//takes in a fastq_ch and outputs a channel with trimmed reads
trimmed_reads_ch = read_trimming(fastq_ch)

//takes in the trimmed reads channel this time
trimmed_fastqc_ch = fastqc_trimmed(trimmed_reads_ch)

//combine the two channels together to use them in multiqc 
combined_fastqc_ch = raw_fastqc_ch.mix(trimmed_fastqc_ch)

//takes in a channel containing fastqc files
//collect is used here to make all files available at the same time.
multiqc(combined_fastqc_ch.collect())
}

An example local run (not on or interacting with DNAnexus) would look like the command below. This assumes you have Nextflow on your own local machine, which is not required for DNAnexus

nextflow run main.nf --fastq_dir "/FASTQ/SRR_*.fastq.gz"

As we gave --fastq_dir a default, if your inputs match that default you could just run

nextflow run main.nf

How Nextflow works on DNAnexus

DNAnexus has developed a version of the Nextflow executor that can orchestrate Nextflow runs on the DNAnexus platform.

Once you kick-off a Nextflow run, a Nextflow 'head-node' is spun up. This stays on for the duration of the run and it spins up and controls the subjobs (each instance of a process).

Head Node

  • orchestrates subjobs

  • contains the Nextflow output directory which is usually specified by params.outdir in nfcore pipelines

  • copies the output directory to the DNAnexus project once all subjobs have completed (--destination)

Subjobs

  • one for every instance of a process

  • each subjob is one virtual machine (instance) e.g., fastqc_process(fileA) is run on one machine and fastqc_process(fileB) is run on a different machine

  • Uses a Docker image for the process environment

  • Required files pulled onto machine and outputs sent back to head node once subjob completed

  • Task execution status, temp files, stdout, sterr logs etc sent to work directory

Work Directory

  • Nextflow uses a 'work' directory (workDir) for executing tasks. Each instance of a process gets its own folder in the work directory and this directory stores task execution info, intermediate files etc.

Note about Batch Processing

You may have learned about batching some inputs for WDL workflows previously. You do not need to do this for Nextflow applets - all parallelisation is done automatically by the Nextflow.

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

Depending on if you choose to or not, you will be able to see this work directory on the platform during/after your nextflow run.

Otherwise, the work directory exists in a and it will be destroyed once a run has completed.

temporary workspace
Full Documentation
cache your work directory