Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • Debugging Checklist for Errors
  • Resources

Was this helpful?

Export as PDF
  1. Building Workflows
  2. Nextflow

Error Strategies for Nextflow

PreviousBuilding Nextflow AppletsNextJob Failures

Last updated 9 months ago

Was this helpful?

Nextflow's errorStrategy directive allows you to define how the error condition is managed by the Nextflow executor at the process level.

There are :

errorStrategy
Description

terminate (default)

terminate all subjobs as soon as any subjob has an error

finish

when any subjob has an error, do not start any additional subjobs and wait for existing jobs to finish before exiting

ignore

pretend you didn't see it..just report a message that the subjob had an error but continue all subjobs

retry

when a subprocess returns an error, retry the process

The DNANexus nextflow documention has a

Generally the errorStrategy is defined in either the base.config (which is referenced using includeConfig in the nextflow.config file) or in the nextflow.config file.

In nfcore pipelines, the default errorStrategy is usually defined in base.config and it is set to 'finish' except for error codes in a specific numeric range which are retried.

The code below is from the

    // memory errors which should be retried. otherwise error out
    errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
    maxRetries    = 1
    maxErrors     = '-1'

The maxRetries directive allows you to define the maximum number of times the exact same subjob can be re-submitted in case of failure and the maxErrors directive allows you to specify the maximum number of times a process (across all subjobs of that process executed) can fail when using the retry error strategy. .

In the code above, if the exit status of the subjob (task) is within 130 to 145, inclusive, or is equal to 104, then it will retry that subjob once (maxRetries = 1). If other subjobs of the same process also have the same issue, they will also be retried once (maxErrors = '-1' disables the max number of times any process can fail so if every subjob executed for a particular process failed it will allow it to be retried the number of times set in maxRetries). Otherwise, the finish errorStategy is applied and the subjob is terminated pending but other running non-errored subjobs are allowed to complete.

For example, imagine you have a fastqc process that takes in one file at a time from a channel with 3 files (file_A, file_B, file_C)

The process is as below and is run for each file in parallel

  • fastqc(file_A)

  • fastqc(file_B)

  • fastqc(file_C)

If the subjob with file_A and the subjob with file_C fail first with errors in range 130-145 or with a 104 error, they can each be retried once if maxRetries =1 .

Now imagine that you set maxErrors = 2. In this case, there are 3 instances of the process but only 2 errors are allowed for all instances of the process. Thus, it will only retry 2 of the subjobs e.g. fastqc(file_A) fastqc(file_C)

If fastqc(file_B) encounters an error at any point, it won't be retried and then the whole job will go to the finish errorStrategy.

Thus, disabling the maxErrors directive by setting it to '-1' allows all failing subjobs with the specified error codes to be retried X amount of times with X set by maxRetries.

Debugging Checklist for Errors

  • Check what version of dxpy was used to build the Nextflow pipeline and make sure it is the newest

  • Look at head-node log (hopefully it was ran with "debug mode" as false because when true, the log gets injected with details which isn't always useful and can make it hard to find errors)

    • Look for the process (sub-job) which caused the error, there will be a record of the error log from that process, though it may be truncated

  • Look at the failed sub-job log

  • Look at the raw code

  • Look at the cached work directories

    • .command.run runs to setup the runtime environment

      • Including staging file

      • Setting up Docker

    • .command.sh is the translated script block of the process

      • Translated because input channels are rendered as actual locations

    • .command.log, .command.out etc are all logs

  • Look at logs with "debug mode" as true

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

4 possible strategies
very detailed description of what happens for each errorStrategy
sarek base.config
See this github issue for more of an explanation
Full Documentation