Building Nextflow Applets
Last updated
Was this helpful?
Last updated
Was this helpful?
.
A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:
(Required) A major Nextflow file with the extension .nf
containing the pipeline. The default filename is main.nf
. A different filename can be specified in the nextflow.config file using manifest.mainScript = 'myfile.nf'
(Optional, recommended) A nextflow.config
file.
(Optional, recommended) A nextflow_schema.json
file. If this file is present when importing or building the executable, the imported executable will expose the nextflow input parameters to the user on the DNAnexus CLI and UI.
(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the major Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
(Optional) A bin
folder containing scripts required by the pipeline can also be used and this will be added to the PATH environment variable by nextflow - for more info see the
For other files/folders such as assets
, an flavored folder structure is encouraged but not required
Create the code for fastqc-nf
We are going to add each file into a folder called fastqc-nf
This is a very simple applet containing only one process which runs FASTQC on files specified using an input samplesheet or from a folder in a project on platform.
It has only three files:
main.nf : The pipeline script file
nextflow.config : Contains config info and sets params
nextflow_schema.json : Specifies the information used by the UI/CLI run command to serve the nextflow params to the user on DNAnexus
The main.nf file
Lets look at the main.nf
file. As a reminder this can be called a different name and the new name specified in the nextflow.config
file using manifest.mainScript = 'myfile.nf'
if needed.
main.nf
DNAnexus expects Nextflow pipelines to use the Nextflow DSL2 standard. If you have learned Nextflow after December 2022 (when Nextflow version 22.12.0 was released) you are using DSL2.
For example machineType 'mem2_ssd1_v2_x2'
If you do not specify the resources required for a process, it will by default use the mem2_ssd1_v2_x4
instance type (this is the same machine type used for the head node) and processes that require more memory than this will fail.
An example of using publishDir multiple times in one process to send outputs to subfolders
Only the 'copy' mode of publishDir is supported on DNAnexus. If you do not specify a mode, then the DNAnexus executor will use copy by default so both of the publishDir lines in the example above are valid.
Assuming at runtime you assign outdir the value of './results', this example places all output files with the ending .html in ./results/fastqc/html and all output files with ending .zip in ./results/fastqc/zip in the head node of the nextflow run.
The entire outdir with subfolder structure intact will be copied to platform location specifed by `--destination' in the CLI or 'Output to' in the UI, once all subjobs have been completed.
In this example applet, I have placed the process and workflow parts in the main.nf script. For larger multi-process applets, you can place your processes in modules/workflows/subworkflows and import them into the main script as done in nfcore pipelines.
The nextflow.config file
Full File:
Explanation of Each Section:
Enable docker by default for this pipeline
Define the input parameters. You can also do this in the main.nf script but by convention nfcore pipelines do it in the nextflow.config. There are three params in this workflow, 'samplesheet' which is a file input, 'reads_dir' which is a directory path and 'outdir' which is a string defining the name of the output folder.
Here I have assigned samplesheet and reads_dir the value of null. Thus if the user does not provide a samplesheet or a reads_dir to the pipeline at runtime, the pipeline will fail. For items such as the samplesheet that should always or nearly always change at runtime, it is valuable to assign them a null value instead of a default so that a user does not accidentally run the pipeline with a default samplesheet thinking they have used a different one.
Here outdir is assigned a default of './results'. Thus, if a user does not specify a string for outdir at runtime, it will use './results'. If a user does specify an outdir, it will use the user specified one instead.
queue-size I have also not defined the queueSize, so when this applet is run, a max of 5 subjobs will run at any one time in parallel, unless you pass the -queue-size
flag to the nextflow_run_opts
options for the applet
nextflow_schema.json
In the New Schema
section, click the blue Submit
button to start.
Near the top of the page, click the 'Add group' button. You need at least one group in your schema file to have it function on platform. All parameters must be placed into a group (you can do this by dragging and dropping them into the group). For example you might have one group called Inputs for all your input parameters and a group called Output for your output parameters with the appropriate parameters placed into the correct groups. Click required
for every non optional parameter.
The default type of input is a string input. For file and directory path input parameters, click the little wheel to the right
At the bottom of the popup in the Format section, for a file input, choose File path
or for a directory path choose Directory path
. Having these 2 correct is important for how the you specify the inputs on platform.
When you are finished building your schema file, click 'Finished', then 'Copy pipeline schema' and paste the information into a file called nextflow_schema.json in the same directory as your applet main.nf and nextflow.config files.
If you note the Schema cache ID
then you can type that into the website to pull up and edit that file within 14 days.
To remove an input parameter for the pipeline from the UI and CLI, you can delete it from the nextflow_schema.json file, or place it in a section of the nextflow_schema.json file that is not referenced in the allOf
section at the bottom of the json file.
You can also remove entire sections by removing their reference from the allOf
section without deleting them from the file.
Ensure that you are in the project that you want to build the applet in using dx pwd
or dx env
. dx select
the correct project if required.
Assuming you have the folder called fastqc-nf with these contents (main.nf is required at a minimum):
Build applet - the applet will build in the root of your project
If you are in the fastqc-nf folder on your machine you will need to cd ..
back a level for the command below to work
or build using --destination
to set the project level folder for the applet
or to build in root of project and just change the name to test-fastqc-nf run
You should see an output like the one below but with a different applet ID.
Use -a
with dx build
to archive previous versions of your applet and -f
to force overwrite previous applet versions. The archived versions are placed in a folder called .Applet_archive
in the root of the project.
You can see the build help using dx build -h
or dx build --help
In the DNAnexus UI:
file-path
will be rendered as a file-picker which enables loading of a file object by selecting it in the UI (can only select one file)
directory-path
will be rendered as a string and will appear in the UI as a text box input. You can point to a directory by typing a string path such as dx://<project-id>:/test/
in the box or multiple files in a path such as dx://<project-id>:/test/*_R{1,2}.fastq.gz
string
is rendered as a string and appears as a text box input on the UI.
Here is part of the fastqc-nf run setup screen
Notice how samplesheet has 'Select File' and a file icon but outdir and reads_dir have text input boxes.
-This is because samplesheet was given 'file-path' in the nextflow_schema.json, but outdir and reads_dir were given as directory-path which renders as a string input, hence the text-box.
In the DNAnexus CLI:
Run the applet with -h
to see the input parameters for the applet
Excerpt of output from command above
string
will appear as class string
e.g., for param outdir
The default here is what we specified as the default in nextflow_schema.json. It cannot 'see' the default that we set in the nextflow.config
so make sure they match when building the json.
directory-path
will appear as class (string)
e.g., for param reads_dir
When (string)
given for parameter (used for folderpaths and strings; the input is of the 'string' class), use dx://project-XXXXX:/path/to/folder
e.g., dx run fastqc-nf -ireads_dir=dx://project-GgYbKGQ0QFpxF6qkPK4KxQ6Q:/FASTQ/*_{1,2}.fastq.gz
file-path
will appear as class file
e.g. for param samplesheet
:
When (file)
is given for parameter (i.e., the input is of the 'file' class), use project-XXXXX:/path/to/file
e.g., dx run fastqc-nf -isamplesheet=project-XXXXX:/samplesheet-example.csv ....
When placing a path to a file on the DNAnexus platform in a samplesheet it would use the format of dx://project-xxx:/path/to/file
Here is an example of a samplesheet with one sample (format of samplesheet is determined by you - this is just for illustration purposes)
In your project on platform, click the fastqc
In the run applet screen, click 'Output to' and choose your output location.
Click 'Next'
At the setup screen, either input a samplesheet or a write the path reads_dir. In the image below, I have used the reads_dir param. Replace 'project-xxx' and '/path/to/reads' with your project-ID and folder name that reads are in.
Review the rest of the inputs and change anything that you want e.g, turn on 'preserve_cache' etc.
Click start analysis
Review the name, output location etc
Click 'Launch Analysis'
Running the fastqc applet with the reads_dir as input
I am turning on preserve_cache
and using -inextflow_run_opts
in the command below for demonstration of how to add them to the command but neither are required here
Note that the *_{1,2}.fastq.gz
is needed here for Channel.fromFilePairs to correctly pair up related files
I do not need -profile docker
in -inextflow_run_opts
as docker was enabled in the nextflow.config
for this applet
--name
names the job
Running the fastqc applet with the samplesheet as input
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.
"In Nextflow version 22.03.0-edge, DSL2 became the default DSL version. In version 22.12.0-edge, DSL1 support was removed, and the Nextflow documentation was updated to use DSL2 by default."
Each process must use a Docker container to define the software environment for the process. See for more information on using docker containers in nextflow processes. Here I am using a public docker image on quay.io. This is the same docker container used by the . You might notice that the container line in the nfcore fastqc module is missing 'quay.io'. This is because this part is expected to be given in the nextflow.config using docker.registry = quay.io
for nfcore pipelines. See . In your own pipeline, you can do it however you please!
You should define the cpus, memory, disk (at least one of these 3), or you can use machineType and the name of the exact that you want to use for this process.
You should use the to capture the output files that you want to publish from each process. It is generally advisable to publish your output files to an output directory defined by params.outdir
(naming doesn't matter once its consistent within your pipeline). You can have as many subfolders of your outdir as needed and you can use the publishDir directive multiple times in the same process to send different output files to different subfolders.
Only relative paths are allowed for publishDir on DNAnexus and thus params.outdir (since this is where files are published to)
General . Do not attempt to access files in the publishDir directories from within a nextflow script as this is bad practice for many reasons. Use channels to pass files between processes.
A common command to make the process fail quickly and loudly when it encounters an issue .
Error Strategy I have not defined an error strategy in the nextflow.config
file. Thus, the default (both local Nextflow executor and DNAnexus executor) strategy is 'terminate'. For more detailed information on choosing an errorStrategy
The nextflow_schema.json
file is needed to reflect the nextflow params (--samplesheet, --reads_dir and --outdir in this case) as DNAnexus applet inputs in the CLI and UI. If it is not present, you will not get the -isamplesheet, -ireads_dir and -ioutdir options for your applet inputs. You can also use it to do parameter validation at runtime using plugins such as .
Once you have written your script and know your parameters, you can make the schema quite quickly using the . Note: do not put sensitive information into this builder as information in it is stored by nfcore for 2 weeks.
There is also the option of using nfcore schema
tools on your computer to create it. You may need to manually add in format
of either file-path
and directory-path
to some parameters if it doesn't do it for you.
Here we will explain how to use the
See for more information on options for nextflow_schema.json
on DNAnexus.
Notice the different way that the path to the samplesheet is specified compared to the reads_dir in the previous example. You can read more about how this .