Building Nextflow Applets
Building Nextflow Applets
Pipeline Script Folder Structure
Building and running nextflow pipelines on dnanexus.
A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:
(Required) A major Nextflow file with the extension
.nf
containing the pipeline. The default filename ismain.nf
. A different filename can be specified in the nextflow.config file usingmanifest.mainScript = 'myfile.nf'
(Optional, recommended) A
nextflow.config
file. See here for nextflow config file information(Optional, recommended) A
nextflow_schema.json
file. If this file is present when importing or building the executable, the imported executable will expose the nextflow input parameters to the user on the DNAnexus CLI and UI.(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the major Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
(Optional) A
bin
folder containing scripts required by the pipeline can also be used and this will be added to the PATH environment variable by nextflow - for more info see the nextflow documentation on custom scripts and toolsFor other files/folders such as
assets
, an nf-core flavored folder structure is encouraged but not required
Reviewing an example minimal nextflow applet
Create the code for fastqc-nf
We are going to add each file into a folder called fastqc-nf
This is a very simple applet containing only one process which runs FASTQC on files specified using an input samplesheet or from a folder in a project on platform.
It has only three files:
main.nf : The pipeline script file
nextflow.config : Contains config info and sets params
nextflow_schema.json : Specifies the information used by the UI/CLI run command to serve the nextflow params to the user on DNAnexus
The main.nf file
Lets look at the main.nf
file. As a reminder this can be called a different name and the new name specified in the nextflow.config
file using manifest.mainScript = 'myfile.nf'
if needed.
main.nf
// Use newest nextflow dsl - not required to add this line - only dsl2 is supported on DNAnexus
nextflow.enable.dsl = 2
log.info """\
===================================
F A S T Q C - E X A M P L E
===================================
samplesheet : ${params.samplesheet}
reads_dir : ${params.reads_dir}
outdir : ${params.outdir}
"""
.stripIndent()
process FASTQC {
tag "FastQC - ${sample_id}"
container 'quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0'
cpus 2
memory { 4.GB * task.attempt }
publishDir "${params.outdir}", pattern: "*", mode:'copy'
input:
tuple val(sample_id), path(reads)
output:
path "*"
script:
"""
fastqc --threads ${task.cpus} $reads
"""
}
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MAIN WORKFLOW
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
workflow {
if (params.samplesheet != null && params.reads_dir == null) {
reads_ch = Channel
.fromPath(params.samplesheet)
.splitCsv()
.map { row -> tuple(row[0], row[1], row[2]) }
reads_ch.view()
FASTQC(reads_ch)
} else if (params.samplesheet == null && params.reads_dir != null) {
reads_ch = Channel.fromFilePairs(params.reads_dir)
reads_ch.view()
FASTQC(reads_ch)
} else {
error "Either samplesheet or reads_dir should be provided, not both"
}
}
workflow.onComplete {
log.info ( workflow.success ? "\nworkflow is done!\n" : "Oops .. something went wrong" )
}
DNAnexus expects Nextflow pipelines to use the Nextflow DSL2 standard. If you have learned Nextflow after December 2022 (when Nextflow version 22.12.0 was released) you are using DSL2.
From the Nextflow docs "In Nextflow version 22.03.0-edge, DSL2 became the default DSL version. In version 22.12.0-edge, DSL1 support was removed, and the Nextflow documentation was updated to use DSL2 by default."
Each process must use a Docker container to define the software environment for the process. See here for more information on using docker containers in nextflow processes. Here I am using a public docker image on quay.io. This is the same docker container used by the nfcore fastqc nf module. You might notice that the container line in the nfcore fastqc module is missing 'quay.io'. This is because this part is expected to be given in the nextflow.config using
docker.registry = quay.io
for nfcore pipelines. See here for an example in sarek. In your own pipeline, you can do it however you please!You should define the cpus, memory, disk (at least one of these 3), or you can use machineType and the name of the exact DNAnexus instance that you want to use for this process.
For example
machineType 'mem2_ssd1_v2_x2'
If you do not specify the resources required for a process, it will by default use the
mem2_ssd1_v2_x4
instance type (this is the same machine type used for the head node) and processes that require more memory than this will fail.You should use the
publishDir
directive to capture the output files that you want to publish from each process. It is generally advisable to publish your output files to an output directory defined byparams.outdir
(naming doesn't matter once its consistent within your pipeline). You can have as many subfolders of your outdir as needed and you can use the publishDir directive multiple times in the same process to send different output files to different subfolders.
An example of using publishDir multiple times in one process to send outputs to subfolders
process foo {
publishDir "${params.outdir}/fastqc/html", pattern "*.html", mode:'copy'
publishDir "${params.outdir}/fastqc/zip", pattern "*.zip"
..
}
Only the 'copy' mode of publishDir is supported on DNAnexus. If you do not specify a mode, then the DNAnexus executor will use copy by default so both of the publishDir lines in the example above are valid.
Assuming at runtime you assign outdir the value of './results', this example places all output files with the ending .html in ./results/fastqc/html and all output files with ending .zip in ./results/fastqc/zip in the head node of the nextflow run.
The entire outdir with subfolder structure intact will be copied to platform location specifed by `--destination' in the CLI or 'Output to' in the UI, once all subjobs have been completed.
Only relative paths are allowed for publishDir on DNAnexus and thus params.outdir (since this is where files are published to) See reference
General nextflow publishDir advice. Do not attempt to access files in the publishDir directories from within a nextflow script as this is bad practice for many reasons. Use channels to pass files between processes.
In this example applet, I have placed the process and workflow parts in the main.nf script. For larger multi-process applets, you can place your processes in modules/workflows/subworkflows and import them into the main script as done in nfcore pipelines.
The nextflow.config file
Full File:
// Default parameters
docker {
enabled = true
}
params {
samplesheet = null
reads_dir = null
outdir = "./results"
}
// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']
Explanation of Each Section:
Enable docker by default for this pipeline
docker {
enabled = true
}
Define the input parameters. You can also do this in the main.nf script but by convention nfcore pipelines do it in the nextflow.config. There are three params in this workflow, 'samplesheet' which is a file input, 'reads_dir' which is a directory path and 'outdir' which is a string defining the name of the output folder.
params {
samplesheet = null
reads_dir = null
outdir = "./results"
}
Here I have assigned samplesheet and reads_dir the value of null. Thus if the user does not provide a samplesheet or a reads_dir to the pipeline at runtime, the pipeline will fail. For items such as the samplesheet that should always or nearly always change at runtime, it is valuable to assign them a null value instead of a default so that a user does not accidentally run the pipeline with a default samplesheet thinking they have used a different one.
Here outdir is assigned a default of './results'. Thus, if a user does not specify a string for outdir at runtime, it will use './results'. If a user does specify an outdir, it will use the user specified one instead.
// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']
A common command to make the process fail quickly and loudly when it encounters an issue Here is a more thorough explanation.
Error Strategy I have not defined an error strategy in the nextflow.config
file. Thus, the default (both local Nextflow executor and DNAnexus executor) strategy is 'terminate'. For more detailed information on choosing an errorStrategy see this section
queue-size I have also not defined the queueSize, so when this applet is run, a max of 5 subjobs will run at any one time in parallel, unless you pass the -queue-size
flag to the nextflow_run_opts
options for the applet
The nextflow_schema.json file
The nextflow_schema.json
file is needed to reflect the nextflow params (--samplesheet, --reads_dir and --outdir in this case) as DNAnexus applet inputs in the CLI and UI. If it is not present, you will not get the -isamplesheet, -ireads_dir and -ioutdir options for your applet inputs. You can also use it to do parameter validation at runtime using plugins such as nf-validation.
nextflow_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema",
"$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
"title": "Nextflow pipeline parameters",
"description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
"type": "object",
"definitions": {
"inputs": {
"title": "Inputs",
"type": "object",
"description": "",
"default": "",
"properties": {
"samplesheet": {
"type": "string",
"description": "Input samplesheet in CSV format",
"format": "file-path"
},
"reads_dir": {
"type": "string",
"description": "Reads directory for file pairs with wildcard",
"format": "directory-path"
},
"outdir": {
"type": "string",
"format": "directory-path",
"description": "Local path to output directory",
"default": "./results"
}
}
}
},
"allOf": [
{
"$ref": "#/definitions/inputs"
}
]
}
Creating a nextflow_schema.json file
Once you have written your script and know your parameters, you can make the schema quite quickly using the nfcore pipeline schema builder website. Note: do not put sensitive information into this builder as information in it is stored by nfcore for 2 weeks.
There is also the option of using nfcore tools nfcore schema
tools on your computer to create it. You may need to manually add in format
of either file-path
and directory-path
to some parameters if it doesn't do it for you.
Here we will explain how to use the nfcore pipeline schema builder website
In the
New Schema
section, click the blueSubmit
button to start.Near the top of the page, click the 'Add group' button. You need at least one group in your schema file to have it function on platform. All parameters must be placed into a group (you can do this by dragging and dropping them into the group). For example you might have one group called Inputs for all your input parameters and a group called Output for your output parameters with the appropriate parameters placed into the correct groups. Click
required
for every non optional parameter.The default type of input is a string input. For file and directory path input parameters, click the little wheel to the right
At the bottom of the popup in the Format section, for a file input, choose
File path
or for a directory path chooseDirectory path
. Having these 2 correct is important for how the you specify the inputs on platform.When you are finished building your schema file, click 'Finished', then 'Copy pipeline schema' and paste the information into a file called nextflow_schema.json in the same directory as your applet main.nf and nextflow.config files.
If you note the
Schema cache ID
then you can type that into the website to pull up and edit that file within 14 days.
To remove an input parameter for the pipeline from the UI and CLI, you can delete it from the nextflow_schema.json file, or place it in a section of the nextflow_schema.json file that is not referenced in the allOf
section at the bottom of the json file.
You can also remove entire sections by removing their reference from the allOf
section without deleting them from the file.
Build the nextflow applet
Ensure that you are in the project that you want to build the applet in using dx pwd
or dx env
. dx select
the correct project if required.
#select project
dx select project-ID
Assuming you have the folder called fastqc-nf with these contents (main.nf is required at a minimum):
main.nf
nextflow.config
nextflow_schema.json
Build applet - the applet will build in the root of your project
If you are in the fastqc-nf folder on your machine you will need to cd ..
back a level for the command below to work
dx build --nextflow fastqc-nf
or build using --destination
to set the project level folder for the applet
dx build -a --nextflow fastqc-nf --destination project-XXXXX:/TEST/fastqc-nf
or to build in root of project and just change the name to test-fastqc-nf run
dx build -a --nextflow fastqc-nf --destination project-XXXXX:/test-fastqc-nf
You should see an output like the one below but with a different applet ID.
{"id": "applet-ID"}
Use -a
with dx build
to archive previous versions of your applet and -f
to force overwrite previous applet versions. The archived versions are placed in a folder called .Applet_archive
in the root of the project.
You can see the build help using dx build -h
or dx build --help
How file-path and directory-path in nextflow_schema.json affect run options
In the DNAnexus UI:
file-path
will be rendered as a file-picker which enables loading of a file object by selecting it in the UI (can only select one file)directory-path
will be rendered as a string and will appear in the UI as a text box input. You can point to a directory by typing a string path such asdx://<project-id>:/test/
in the box or multiple files in a path such asdx://<project-id>:/test/*_R{1,2}.fastq.gz
string
is rendered as a string and appears as a text box input on the UI.
Here is part of the fastqc-nf run setup screen

Notice how samplesheet has 'Select File' and a file icon but outdir and reads_dir have text input boxes.
-This is because samplesheet was given 'file-path' in the nextflow_schema.json, but outdir and reads_dir were given as directory-path which renders as a string input, hence the text-box.
In the DNAnexus CLI:
Run the applet with -h
to see the input parameters for the applet
dx run fastqc-nf -h
Excerpt of output from command above
usage: dx run fastqc-nf [-iINPUT_NAME=VALUE ...]
Applet: fastqc-nf
fastqc-nf
Inputs:
outdir: [-ioutdir=(string)]
(Nextflow pipeline required) Default value:./results
reads_dir: [-ireads_dir=(string)]
(Nextflow pipeline required)
samplesheet: [-isamplesheet=(file)]
(Nextflow pipeline required)
....
string
will appear as classstring
e.g., for paramoutdir
The default here is what we specified as the default in nextflow_schema.json. It cannot 'see' the default that we set in the
nextflow.config
so make sure they match when building the json.outdir: [-ioutdir=(string)] (Nextflow pipeline required) Default value:./results
directory-path
will appear as class(string)
e.g., for paramreads_dir
reads_dir: [-ireads_dir=(string)] (Nextflow pipeline required)
When
(string)
given for parameter (used for folderpaths and strings; the input is of the 'string' class), usedx://project-XXXXX:/path/to/folder
e.g.,dx run fastqc-nf -ireads_dir=dx://project-GgYbKGQ0QFpxF6qkPK4KxQ6Q:/FASTQ/*_{1,2}.fastq.gz
file-path
will appear as classfile
e.g. for paramsamplesheet
:samplesheet: [-isamplesheet=(file)] (Nextflow pipeline required)
When
(file)
is given for parameter (i.e., the input is of the 'file' class), useproject-XXXXX:/path/to/file
e.g.,dx run fastqc-nf -isamplesheet=project-XXXXX:/samplesheet-example.csv ....
See here for more information on options for nextflow_schema.json
on DNAnexus.
Running the Nextflow Pipeline Applet
Using samplesheets
When placing a path to a file on the DNAnexus platform in a samplesheet it would use the format of dx://project-xxx:/path/to/file
Here is an example of a samplesheet with one sample (format of samplesheet is determined by you - this is just for illustration purposes)
sample_name,fastq_1,fastq_2
sampleA,dx://project-xxx:/path/to/sampleA_r1.fastq.gz,dx://project-xxx:/path/to/sampleA_r2.fastq.gz
Run the applet from the UI
In your project on platform, click the fastqc

In the run applet screen, click 'Output to' and choose your output location.
Click 'Next'
At the setup screen, either input a samplesheet or a write the path reads_dir. In the image below, I have used the reads_dir param. Replace 'project-xxx' and '/path/to/reads' with your project-ID and folder name that reads are in.

Review the rest of the inputs and change anything that you want e.g, turn on 'preserve_cache' etc.

Click start analysis

Review the name, output location etc

Click 'Launch Analysis'
Run the applet on the CLI
Running the fastqc applet with the reads_dir as input
I am turning on
preserve_cache
and using-inextflow_run_opts
in the command below for demonstration of how to add them to the command but neither are required hereNote that the
*_{1,2}.fastq.gz
is needed here for Channel.fromFilePairs to correctly pair up related filesI do not need
-profile docker
in-inextflow_run_opts
as docker was enabled in thenextflow.config
for this applet--name
names the job
dx run fastqc-nf \
-ireads_dir="dx://project-ID:/FASTQ/*_{1,2}.fastq.gz" \
-ioutdir="./fastqc-out-rd" \
-ipreserve_cache=true \
-inextflow_run_opts='-queue-size 10' \
--destination "project-ID:/USERS/FOLDERNAME" \
--name fastqc-nf-with-reads-dir \
-y
Running the fastqc applet with the samplesheet as input
dx run fastqc-nf -isamplesheet="project-ID:/samplesheet-example.csv" \
-ioutdir="./fastqc-out-sh" \
--destination "project-ID:/USERS/FILENAME" \
--name fastqc-nf-with-samplesheet \
-y
Notice the different way that the path to the samplesheet is specified compared to the reads_dir in the previous example. You can read more about how this here.
Resources
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.
Last updated
Was this helpful?