Building Nextflow Applets

Pipeline Script Folder Structure

Building and running nextflow pipelines on dnanexus.

A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

(Required) A major Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file using manifest.mainScript = 'myfile.nf'
(Optional, recommended) A nextflow.config file. See here for nextflow config file information
(Optional, recommended) A nextflow_schema.json file. If this file is present when importing or building the executable, the imported executable will expose the nextflow input parameters to the user on the DNAnexus CLI and UI.
(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the major Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
(Optional) A bin folder containing scripts required by the pipeline can also be used and this will be added to the PATH environment variable by nextflow - for more info see the nextflow documentation on custom scripts and tools
For other files/folders such as assets, an nf-core flavored folder structure is encouraged but not required

Reviewing an example minimal nextflow applet

Create the code for fastqc-nf

We are going to add each file into a folder called fastqc-nf

This is a very simple applet containing only one process which runs FASTQC on files specified using an input samplesheet or from a folder in a project on platform.

It has only three files:

main.nf : The pipeline script file
nextflow.config : Contains config info and sets params
nextflow_schema.json : Specifies the information used by the UI/CLI run command to serve the nextflow params to the user on DNAnexus

The main.nf file

Lets look at the main.nf file. As a reminder this can be called a different name and the new name specified in the nextflow.config file using manifest.mainScript = 'myfile.nf' if needed.

main.nf

// Use newest nextflow dsl - not required to add this line - only dsl2 is supported on DNAnexus
nextflow.enable.dsl = 2

log.info """\
    ===================================
            F A S T Q C - E X A M P L E
    ===================================
    samplesheet : ${params.samplesheet}
    reads_dir   : ${params.reads_dir}
    outdir      : ${params.outdir}
    """
    .stripIndent()


process FASTQC {

    tag "FastQC - ${sample_id}"

    container 'quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0'
    cpus 2
    memory { 4.GB * task.attempt }
    

    publishDir "${params.outdir}", pattern: "*", mode:'copy'

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*"

    script:
    """
     fastqc --threads ${task.cpus} $reads                      
    """
}


/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    MAIN WORKFLOW
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

workflow {
    if (params.samplesheet != null && params.reads_dir == null) {
        
        reads_ch = Channel
            .fromPath(params.samplesheet)
            .splitCsv()
            .map { row -> tuple(row[0], row[1], row[2]) }

            reads_ch.view()
            FASTQC(reads_ch)

    } else if (params.samplesheet == null && params.reads_dir != null) {
        reads_ch = Channel.fromFilePairs(params.reads_dir)

        reads_ch.view()
        FASTQC(reads_ch)

    } else {
        error "Either samplesheet or reads_dir should be provided, not both"
    }
}


workflow.onComplete {
    log.info ( workflow.success ? "\nworkflow is done!\n" : "Oops .. something went wrong" )
}

DNAnexus expects Nextflow pipelines to use the Nextflow DSL2 standard. If you have learned Nextflow after December 2022 (when Nextflow version 22.12.0 was released) you are using DSL2.
- From the Nextflow docs "In Nextflow version 22.03.0-edge, DSL2 became the default DSL version. In version 22.12.0-edge, DSL1 support was removed, and the Nextflow documentation was updated to use DSL2 by default."
Each process must use a Docker container to define the software environment for the process. See here for more information on using docker containers in nextflow processes. Here I am using a public docker image on quay.io. This is the same docker container used by the nfcore fastqc nf module. You might notice that the container line in the nfcore fastqc module is missing 'quay.io'. This is because this part is expected to be given in the nextflow.config using docker.registry = quay.io for nfcore pipelines. See here for an example in sarek. In your own pipeline, you can do it however you please!
You should define the cpus, memory, disk (at least one of these 3), or you can use machineType and the name of the exact DNAnexus instance that you want to use for this process.
For example machineType 'mem2_ssd1_v2_x2'
If you do not specify the resources required for a process, it will by default use the mem2_ssd1_v2_x4 instance type (this is the same machine type used for the head node) and processes that require more memory than this will fail.
You should use the publishDir directive to capture the output files that you want to publish from each process. It is generally advisable to publish your output files to an output directory defined by params.outdir (naming doesn't matter once its consistent within your pipeline). You can have as many subfolders of your outdir as needed and you can use the publishDir directive multiple times in the same process to send different output files to different subfolders.

An example of using publishDir multiple times in one process to send outputs to subfolders

process foo {

 publishDir "${params.outdir}/fastqc/html", pattern "*.html", mode:'copy'
 publishDir "${params.outdir}/fastqc/zip", pattern "*.zip"

..
}

Only the 'copy' mode of publishDir is supported on DNAnexus. If you do not specify a mode, then the DNAnexus executor will use copy by default so both of the publishDir lines in the example above are valid.

Assuming at runtime you assign outdir the value of './results', this example places all output files with the ending .html in ./results/fastqc/html and all output files with ending .zip in ./results/fastqc/zip in the head node of the nextflow run.

The entire outdir with subfolder structure intact will be copied to platform location specifed by `--destination' in the CLI or 'Output to' in the UI, once all subjobs have been completed.

Only relative paths are allowed for publishDir on DNAnexus and thus params.outdir (since this is where files are published to) See reference

General nextflow publishDir advice. Do not attempt to access files in the publishDir directories from within a nextflow script as this is bad practice for many reasons. Use channels to pass files between processes.

In this example applet, I have placed the process and workflow parts in the main.nf script. For larger multi-process applets, you can place your processes in modules/workflows/subworkflows and import them into the main script as done in nfcore pipelines.

The nextflow.config file

Full File:

// Default parameters

docker {
    enabled = true
}

params {
    samplesheet = null
    reads_dir = null
    outdir = "./results"
}

// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']

Explanation of Each Section:

Enable docker by default for this pipeline

docker {
    enabled = true
}

Define the input parameters. You can also do this in the main.nf script but by convention nfcore pipelines do it in the nextflow.config. There are three params in this workflow, 'samplesheet' which is a file input, 'reads_dir' which is a directory path and 'outdir' which is a string defining the name of the output folder.

params {
    samplesheet = null
    reads_dir = null
    outdir = "./results"
}

Here I have assigned samplesheet and reads_dir the value of null. Thus if the user does not provide a samplesheet or a reads_dir to the pipeline at runtime, the pipeline will fail. For items such as the samplesheet that should always or nearly always change at runtime, it is valuable to assign them a null value instead of a default so that a user does not accidentally run the pipeline with a default samplesheet thinking they have used a different one.
Here outdir is assigned a default of './results'. Thus, if a user does not specify a string for outdir at runtime, it will use './results'. If a user does specify an outdir, it will use the user specified one instead.

// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']

A common command to make the process fail quickly and loudly when it encounters an issue Here is a more thorough explanation.

Error Strategy I have not defined an error strategy in the nextflow.config file. Thus, the default (both local Nextflow executor and DNAnexus executor) strategy is 'terminate'. For more detailed information on choosing an errorStrategy see this section

queue-size I have also not defined the queueSize, so when this applet is run, a max of 5 subjobs will run at any one time in parallel, unless you pass the -queue-size flag to the nextflow_run_opts options for the applet

The nextflow_schema.json file

The nextflow_schema.json file is needed to reflect the nextflow params (--samplesheet, --reads_dir and --outdir in this case) as DNAnexus applet inputs in the CLI and UI. If it is not present, you will not get the -isamplesheet, -ireads_dir and -ioutdir options for your applet inputs. You can also use it to do parameter validation at runtime using plugins such as nf-validation.

nextflow_schema.json

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
  "title": "Nextflow pipeline parameters",
  "description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
  "type": "object",
  "definitions": {
      "inputs": {
          "title": "Inputs",
          "type": "object",
          "description": "",
          "default": "",
          "properties": {
              "samplesheet": {
                  "type": "string",
                  "description": "Input samplesheet in CSV format",
                  "format": "file-path"
              },
              "reads_dir": {
                "type": "string",
                "description": "Reads directory for file pairs with wildcard",
                "format": "directory-path"
            },             
              "outdir": {
                  "type": "string",
                  "format": "directory-path",
                  "description": "Local path to output directory",
                  "default": "./results"
              }
          }
      }
  },
  "allOf": [
      {
          "$ref": "#/definitions/inputs"
      }
  ]
}

Creating a nextflow_schema.json file

Once you have written your script and know your parameters, you can make the schema quite quickly using the nfcore pipeline schema builder website. Note: do not put sensitive information into this builder as information in it is stored by nfcore for 2 weeks.

There is also the option of using nfcore tools nfcore schema tools on your computer to create it. You may need to manually add in format of either file-path and directory-path to some parameters if it doesn't do it for you.

Here we will explain how to use the nfcore pipeline schema builder website

In the New Schema section, click the blue Submit button to start.
Near the top of the page, click the 'Add group' button. You need at least one group in your schema file to have it function on platform. All parameters must be placed into a group (you can do this by dragging and dropping them into the group). For example you might have one group called Inputs for all your input parameters and a group called Output for your output parameters with the appropriate parameters placed into the correct groups. Click required for every non optional parameter.
The default type of input is a string input. For file and directory path input parameters, click the little wheel to the right
At the bottom of the popup in the Format section, for a file input, choose File path or for a directory path choose Directory path. Having these 2 correct is important for how the you specify the inputs on platform.
When you are finished building your schema file, click 'Finished', then 'Copy pipeline schema' and paste the information into a file called nextflow_schema.json in the same directory as your applet main.nf and nextflow.config files.
If you note the Schema cache ID then you can type that into the website to pull up and edit that file within 14 days.

To remove an input parameter for the pipeline from the UI and CLI, you can delete it from the nextflow_schema.json file, or place it in a section of the nextflow_schema.json file that is not referenced in the allOf section at the bottom of the json file.

You can also remove entire sections by removing their reference from the allOf section without deleting them from the file.

Build the nextflow applet

Ensure that you are in the project that you want to build the applet in using dx pwd or dx env. dx select the correct project if required.

#select project
dx select project-ID

Assuming you have the folder called fastqc-nf with these contents (main.nf is required at a minimum):

main.nf 
nextflow.config
nextflow_schema.json

Build applet - the applet will build in the root of your project

If you are in the fastqc-nf folder on your machine you will need to cd .. back a level for the command below to work

dx build --nextflow fastqc-nf

or build using --destination to set the project level folder for the applet

dx build -a --nextflow fastqc-nf --destination project-XXXXX:/TEST/fastqc-nf

or to build in root of project and just change the name to test-fastqc-nf run

dx build -a --nextflow fastqc-nf --destination project-XXXXX:/test-fastqc-nf

You should see an output like the one below but with a different applet ID.

{"id": "applet-ID"}

Use -a with dx build to archive previous versions of your applet and -f to force overwrite previous applet versions. The archived versions are placed in a folder called .Applet_archive in the root of the project.

You can see the build help using dx build -h or dx build --help

How file-path and directory-path in nextflow_schema.json affect run options

In the DNAnexus UI:

file-path will be rendered as a file-picker which enables loading of a file object by selecting it in the UI (can only select one file)
directory-path will be rendered as a string and will appear in the UI as a text box input. You can point to a directory by typing a string path such as dx://<project-id>:/test/ in the box or multiple files in a path such as dx://<project-id>:/test/*_R{1,2}.fastq.gz
string is rendered as a string and appears as a text box input on the UI.

Here is part of the fastqc-nf run setup screen

Notice how samplesheet has 'Select File' and a file icon but outdir and reads_dir have text input boxes.

-This is because samplesheet was given 'file-path' in the nextflow_schema.json, but outdir and reads_dir were given as directory-path which renders as a string input, hence the text-box.

In the DNAnexus CLI:

Run the applet with -h to see the input parameters for the applet

dx run fastqc-nf -h

Excerpt of output from command above

usage: dx run fastqc-nf [-iINPUT_NAME=VALUE ...]

Applet: fastqc-nf

fastqc-nf

Inputs:
  outdir: [-ioutdir=(string)]
        (Nextflow pipeline required) Default value:./results

  reads_dir: [-ireads_dir=(string)]
        (Nextflow pipeline required)

  samplesheet: [-isamplesheet=(file)]
        (Nextflow pipeline required)

        ....

string will appear as class string e.g., for param outdir
The default here is what we specified as the default in nextflow_schema.json. It cannot 'see' the default that we set in the nextflow.config so make sure they match when building the json.
```
outdir: [-ioutdir=(string)]
    (Nextflow pipeline required) Default value:./results
```
directory-path will appear as class (string) e.g., for param reads_dir
```
reads_dir: [-ireads_dir=(string)]
    (Nextflow pipeline required)
```
When (string) given for parameter (used for folderpaths and strings; the input is of the 'string' class), use dx://project-XXXXX:/path/to/folder e.g., dx run fastqc-nf -ireads_dir=dx://project-GgYbKGQ0QFpxF6qkPK4KxQ6Q:/FASTQ/*_{1,2}.fastq.gz
file-path will appear as class file e.g. for param samplesheet:
```
samplesheet: [-isamplesheet=(file)]
    (Nextflow pipeline required)
```
When (file) is given for parameter (i.e., the input is of the 'file' class), use project-XXXXX:/path/to/file e.g., dx run fastqc-nf -isamplesheet=project-XXXXX:/samplesheet-example.csv ....

See here for more information on options for nextflow_schema.json on DNAnexus.

Running the Nextflow Pipeline Applet

Using samplesheets

When placing a path to a file on the DNAnexus platform in a samplesheet it would use the format of dx://project-xxx:/path/to/file

Here is an example of a samplesheet with one sample (format of samplesheet is determined by you - this is just for illustration purposes)

sample_name,fastq_1,fastq_2
sampleA,dx://project-xxx:/path/to/sampleA_r1.fastq.gz,dx://project-xxx:/path/to/sampleA_r2.fastq.gz

Run the applet from the UI

In your project on platform, click the fastqc

In the run applet screen, click 'Output to' and choose your output location.
Click 'Next'
At the setup screen, either input a samplesheet or a write the path reads_dir. In the image below, I have used the reads_dir param. Replace 'project-xxx' and '/path/to/reads' with your project-ID and folder name that reads are in.

Review the rest of the inputs and change anything that you want e.g, turn on 'preserve_cache' etc.

Click start analysis

Review the name, output location etc

Click 'Launch Analysis'

Run the applet on the CLI

Running the fastqc applet with the reads_dir as input

I am turning on preserve_cache and using -inextflow_run_opts in the command below for demonstration of how to add them to the command but neither are required here
Note that the *_{1,2}.fastq.gz is needed here for Channel.fromFilePairs to correctly pair up related files
I do not need -profile docker in -inextflow_run_opts as docker was enabled in the nextflow.config for this applet
--name names the job

dx run fastqc-nf \
-ireads_dir="dx://project-ID:/FASTQ/*_{1,2}.fastq.gz" \
-ioutdir="./fastqc-out-rd" \
-ipreserve_cache=true \
-inextflow_run_opts='-queue-size 10' \
--destination "project-ID:/USERS/FOLDERNAME" \
--name fastqc-nf-with-reads-dir \
-y

Running the fastqc applet with the samplesheet as input

dx run fastqc-nf -isamplesheet="project-ID:/samplesheet-example.csv" \
-ioutdir="./fastqc-out-sh" \
--destination "project-ID:/USERS/FILENAME" \
--name fastqc-nf-with-samplesheet \
-y

Notice the different way that the path to the samplesheet is specified compared to the reads_dir in the previous example. You can read more about how this here.

Resources

Full Documentation

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

PreviousImporting Nf-Core NextError Strategies for Nextflow

Last updated 11 months ago

Was this helpful?