Job Failures
Last updated
Was this helpful?
Last updated
Was this helpful?
If your nextflow run fails, the nextflow job log is written to your project Output location (CLI flag --destination) that you set for the applet at runtime.
However, on failure, your results files in params.outdir are not written to the project, unless you are using the 'ignore' error strategy.
To guard against having long running or expensive (or both!) runs that you get no output from when they fail you need to think carefully about what should happen when your job fails and if you need the ability to resume it. This means that successfully completed processes won't be run again saving you the cost and time of running already successfully completed jobs.
Nextflow has a resume feature to enable runs that fail to be resumed again which
To be able to resume a run that failed you need to set preserve_cache
to true for the initial run. This will cache the nextflow workDir of the run in your project on platform in a folder called .nextflow_cache_db/<session_id>/
.
The session ID is a unique ID given to each (non-resumed) Nextflow run. Resumed Nextflow runs will share the same session ID as the run that they are resuming since they are using the same cache.
The cache is the nextflow workDir which is where nextflow stores each tasks files during runs. By default when you run a nextflow applet, preserve_cache
is set to false. In this state, if the applet fails you will not have the ability to resume the run and you are not able to see the contents of the work directory in your project.
To turn on preserve_cache for a run add -ipreserve_cache=true
to your run command.
In the UI, scroll to the bottom of the Nextflow run setup screen
So if you are running a job and think there is a chance that you might want to resume it if it fails, then turn on preserve_cache
.
Note that if you terminate a job manually i.e., using the terminate
button in the UI or with dx terminate
the cache will not be preserved and you will not be able to resume the run even if preserve_cache
was set to true for the run. The same applies if a job is terminated due to a job cost limit being exceeded. Essentially, if it is not the DNAnexus executor terminating the run, then the cache is not preserved and so resuming the run is not possible.
You can store up to 20 caches in a project and a cache will be stored for a maximum of 6 months. Once that limit has been reached you will get a failure if you try to run another job with preserve cache switched on. In practice you should regurlary delete your cache folders once you have had successful runs and no longer need them to save on storage costs.
You can make changes to the Nextflow applet, dx build it again and/or make changes to the run inputs before resuming a run.
When you resume a run in the CLI using the session ID, the run will resume from what is cached for the session id on the project.
Only one Nextflow job with the same session ID can run at any time.
When resume is assigned with 'true' or 'last, the run will determine the session id that corresponds to the latest valid execution in the current project and resume the run from it
or
To setup the sarek command to preserve the Cache
To resume a sarek run and preserve updates to the cache from the new run (which will allow further resumes in case this resumed run fails) use the code below:
To get the session-id of a run, click the run in the monitor tab of your project and scroll down to the bottom of the page. On the bottom right you should see the session ID in the 'Properties' section
If you know your job ID, you can also use that to get the session ID on the CLI using
Check what version of dxpy was used to build the Nextflow pipeline and make sure it is the newest
Look at head-node log (hopefully it was ran with "debug mode" as false because when true, the log gets injected with details which isn't always useful and can make it hard to find errors)
Look for the process (sub-job) which caused the error, there will be a record of the error log from that process, though it may be truncated
Look at the failed sub-job log
Look at the raw code
Look at the cached work directories
.command.run runs to setup the runtime environment
Including staging file
Setting up Docker
.command.sh is the translated script block of the process
Translated because input channels are rendered as actual locations
.command.log, .command.out etc are all logs
Look at logs with "debug mode" as true
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.