Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • Caching the Nextflow workDir
  • Cache limits
  • Resuming a run
  • Debugging Checklist for Errors
  • Resources

Was this helpful?

Export as PDF
  1. Building Workflows
  2. Nextflow

Job Failures

PreviousError Strategies for NextflowNextUseful Information

Last updated 9 months ago

Was this helpful?

If your nextflow run fails, the nextflow job log is written to your project Output location (CLI flag --destination) that you set for the applet at runtime.

However, on failure, your results files in params.outdir are not written to the project, unless you are using the 'ignore' error strategy.

To guard against having long running or expensive (or both!) runs that you get no output from when they fail you need to think carefully about what should happen when your job fails and if you need the ability to resume it. This means that successfully completed processes won't be run again saving you the cost and time of running already successfully completed jobs.

Nextflow has a resume feature to enable runs that fail to be resumed again which

Caching the Nextflow workDir

To be able to resume a run that failed you need to set preserve_cache to true for the initial run. This will cache the nextflow workDir of the run in your project on platform in a folder called .nextflow_cache_db/<session_id>/.

The session ID is a unique ID given to each (non-resumed) Nextflow run. Resumed Nextflow runs will share the same session ID as the run that they are resuming since they are using the same cache.

The cache is the nextflow workDir which is where nextflow stores each tasks files during runs. By default when you run a nextflow applet, preserve_cache is set to false. In this state, if the applet fails you will not have the ability to resume the run and you are not able to see the contents of the work directory in your project.

To turn on preserve_cache for a run add -ipreserve_cache=true to your run command.

dx run applet-xxxx -ipreserve_cache=true

In the UI, scroll to the bottom of the Nextflow run setup screen

So if you are running a job and think there is a chance that you might want to resume it if it fails, then turn on preserve_cache.

Note that if you terminate a job manually i.e., using the terminate button in the UI or with dx terminate the cache will not be preserved and you will not be able to resume the run even if preserve_cache was set to true for the run. The same applies if a job is terminated due to a job cost limit being exceeded. Essentially, if it is not the DNAnexus executor terminating the run, then the cache is not preserved and so resuming the run is not possible.

Cache limits

You can store up to 20 caches in a project and a cache will be stored for a maximum of 6 months. Once that limit has been reached you will get a failure if you try to run another job with preserve cache switched on. In practice you should regurlary delete your cache folders once you have had successful runs and no longer need them to save on storage costs.

Resuming a run

You can make changes to the Nextflow applet, dx build it again and/or make changes to the run inputs before resuming a run.

When you resume a run in the CLI using the session ID, the run will resume from what is cached for the session id on the project.

Only one Nextflow job with the same session ID can run at any time.

dx run applet-xxxx -iresume='session-id'

When resume is assigned with 'true' or 'last, the run will determine the session id that corresponds to the latest valid execution in the current project and resume the run from it

dx run applet-xxxx -iresume='last'

or

dx run applet-xxxx -iresume=true

To setup the sarek command to preserve the Cache

dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs_ch' -ipreserve_cache=true -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME' 

To resume a sarek run and preserve updates to the cache from the new run (which will allow further resumes in case this resumed run fails) use the code below:

dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs_ch' -ipreserve_cache=true -iresume='last' -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME' 

To get the session-id of a run, click the run in the monitor tab of your project and scroll down to the bottom of the page. On the bottom right you should see the session ID in the 'Properties' section

If you know your job ID, you can also use that to get the session ID on the CLI using

dx describe job-ID --json | jq -r .properties.nextflow_session_id
#ID

Debugging Checklist for Errors

  • Check what version of dxpy was used to build the Nextflow pipeline and make sure it is the newest

  • Look at head-node log (hopefully it was ran with "debug mode" as false because when true, the log gets injected with details which isn't always useful and can make it hard to find errors)

    • Look for the process (sub-job) which caused the error, there will be a record of the error log from that process, though it may be truncated

  • Look at the failed sub-job log

  • Look at the raw code

  • Look at the cached work directories

    • .command.run runs to setup the runtime environment

      • Including staging file

      • Setting up Docker

    • .command.sh is the translated script block of the process

      • Translated because input channels are rendered as actual locations

    • .command.log, .command.out etc are all logs

  • Look at logs with "debug mode" as true

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Full Documentation
can be used on DNAnexus