# Building Nextflow Applets

***

## Building Nextflow Applets

### Pipeline Script Folder Structure

[Building and running nextflow pipelines on dnanexus](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines).

A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

* **(Required)** A major Nextflow file with the extension `.nf` containing the pipeline. The default filename is `main.nf`. A different filename can be specified in the nextflow\.config file using `manifest.mainScript = 'myfile.nf'`
* **(Optional, recommended)** A `nextflow.config` file. [See here for nextflow config file information](https://www.nextflow.io/docs/latest/config.html#configuration-file)
* **(Optional, recommended)** A `nextflow_schema.json` file. If this file is present when importing or building the executable, the imported executable will expose the nextflow input parameters to the user on the DNAnexus CLI and UI.
* *(Optional)* Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the major Nextflow file or nextflow\.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
* *(Optional)* A `bin` folder containing scripts required by the pipeline can also be used and this will be added to the PATH environment variable by nextflow - for more info see the [nextflow documentation on custom scripts and tools](https://www.nextflow.io/docs/latest/faq.html#how-do-i-invoke-custom-scripts-and-tools)
* For other files/folders such as `assets`, an [nf-core](https://nf-co.re/pipelines) flavored folder structure is encouraged but not required

### Reviewing an example minimal nextflow applet

Create the code for `fastqc-nf`

We are going to add each file into a folder called fastqc-nf

This is a very simple applet containing only one process which runs FASTQC on files specified using an input samplesheet *or* from a folder in a project on platform.

It has only three files:

* main.nf : The pipeline script file
* nextflow\.config : Contains config info and sets params
* nextflow\_schema.json : Specifies the information used by the UI/CLI run command to serve the nextflow params to the user on DNAnexus

**The main.nf file**

Lets look at the `main.nf` file. As a reminder this can be called a different name and the new name specified in the `nextflow.config` file using `manifest.mainScript = 'myfile.nf'` if needed.

**main.nf**

```
// Use newest nextflow dsl - not required to add this line - only dsl2 is supported on DNAnexus
nextflow.enable.dsl = 2

log.info """\
    ===================================
            F A S T Q C - E X A M P L E
    ===================================
    samplesheet : ${params.samplesheet}
    reads_dir   : ${params.reads_dir}
    outdir      : ${params.outdir}
    """
    .stripIndent()


process FASTQC {

    tag "FastQC - ${sample_id}"

    container 'quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0'
    cpus 2
    memory { 4.GB * task.attempt }
    

    publishDir "${params.outdir}", pattern: "*", mode:'copy'

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*"

    script:
    """
     fastqc --threads ${task.cpus} $reads                      
    """
}


/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    MAIN WORKFLOW
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

workflow {
    if (params.samplesheet != null && params.reads_dir == null) {
        
        reads_ch = Channel
            .fromPath(params.samplesheet)
            .splitCsv()
            .map { row -> tuple(row[0], row[1], row[2]) }

            reads_ch.view()
            FASTQC(reads_ch)

    } else if (params.samplesheet == null && params.reads_dir != null) {
        reads_ch = Channel.fromFilePairs(params.reads_dir)

        reads_ch.view()
        FASTQC(reads_ch)

    } else {
        error "Either samplesheet or reads_dir should be provided, not both"
    }
}


workflow.onComplete {
    log.info ( workflow.success ? "\nworkflow is done!\n" : "Oops .. something went wrong" )
}
```

1. DNAnexus expects Nextflow pipelines to use the Nextflow DSL2 standard. If you have learned Nextflow after December 2022 (when Nextflow version 22.12.0 was released) you are using DSL2.
   * [From the Nextflow docs](https://www.nextflow.io/docs/latest/dsl1.html) *"In Nextflow version 22.03.0-edge, DSL2 became the default DSL version. In version 22.12.0-edge, DSL1 support was removed, and the Nextflow documentation was updated to use DSL2 by default."*
2. Each process must use a Docker container to define the software environment for the process. See [here](https://www.nextflow.io/docs/latest/container.html#id7) for more information on using docker containers in nextflow processes. Here I am using a public docker image on quay.io. This is the same docker container used by the [nfcore fastqc nf module](https://github.com/nf-core/modules/blob/master/modules/nf-core/fastqc/main.nf#L7). You might notice that the container line in the nfcore fastqc module is missing 'quay.io'. This is because this part is expected to be given in the nextflow\.config using `docker.registry = quay.io` for nfcore pipelines. See [here for an example in sarek](https://github.com/nf-core/sarek/blob/3.4.0/nextflow.config#L289). In your own pipeline, you can do it however you please!
3. You should define the cpus, memory, disk (at least one of these 3), or you can use machineType and the name of the exact [DNAnexus instance](https://documentation.dnanexus.com/developer/api/running-analyses/instance-types) that you want to use for this process.

   For example `machineType 'mem2_ssd1_v2_x2'`

   If you do not specify the resources required for a process, it will by default use the `mem2_ssd1_v2_x4` instance type (this is the same machine type used for the head node) and processes that require more memory than this will fail.
4. You should use the [`publishDir` directive](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines#values-of-publishdir) to capture the output files that you want to publish from each process. It is generally advisable to publish your output files to an output directory defined by `params.outdir` (naming doesn't matter once its consistent within your pipeline). You can have as many subfolders of your outdir as needed and you can use the publishDir directive multiple times in the same process to send different output files to different subfolders.

An example of using publishDir multiple times in one process to send outputs to subfolders

```
process foo {

 publishDir "${params.outdir}/fastqc/html", pattern "*.html", mode:'copy'
 publishDir "${params.outdir}/fastqc/zip", pattern "*.zip"

..
}
```

Only the 'copy' mode of publishDir is supported on DNAnexus. If you do not specify a mode, then the DNAnexus executor will use copy by default so both of the publishDir lines in the example above are valid.

Assuming at runtime you assign outdir the value of './results', this example places all output files with the ending .html in ./results/fastqc/html and all output files with ending .zip in ./results/fastqc/zip in the head node of the nextflow run.

The entire outdir with subfolder structure intact will be copied to platform location specifed by \`--destination' in the CLI or 'Output to' in the UI, once all subjobs have been completed.

**Only relative paths are allowed for publishDir on DNAnexus and thus params.outdir (since this is where files are published to)** [See reference](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines#values-of-publishdir)

General [nextflow publishDir advice](https://www.nextflow.io/docs/latest/process.html#publishdir). Do not attempt to access files in the publishDir directories from within a nextflow script as this is bad practice for many reasons. Use channels to pass files between processes.

5. In this example applet, I have placed the process and workflow parts in the main.nf script. For larger multi-process applets, you can place your processes in modules/workflows/subworkflows and import them into the main script as done in nfcore pipelines.

**The nextflow\.config file**

**Full File:**

```
// Default parameters

docker {
    enabled = true
}

params {
    samplesheet = null
    reads_dir = null
    outdir = "./results"
}

// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']
```

**Explanation of Each Section:**

1. Enable docker by default for this pipeline

```
docker {
    enabled = true
}
```

2. Define the input parameters. You can also do this in the main.nf script but by convention nfcore pipelines do it in the nextflow\.config. There are three params in this workflow, 'samplesheet' which is a file input, 'reads\_dir' which is a directory path and 'outdir' which is a string defining the name of the output folder.

```
params {
    samplesheet = null
    reads_dir = null
    outdir = "./results"
}
```

3. Here I have assigned samplesheet and reads\_dir the value of null. Thus if the user does not provide a samplesheet or a reads\_dir to the pipeline at runtime, the pipeline will fail. For items such as the samplesheet that should always or nearly always change at runtime, it is valuable to assign them a null value instead of a default so that a user does not accidentally run the pipeline with a default samplesheet thinking they have used a different one.
4. Here outdir is assigned a default of './results'. Thus, if a user does not specify a string for outdir at runtime, it will use './results'. If a user does specify an outdir, it will use the user specified one instead.

```
// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']
```

5. A common command to make the process fail quickly and loudly when it encounters an issue [Here is a more thorough explanation](https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e8223e16425#set--e--u--x--o-pipefail).

**Error Strategy** I have not defined an error strategy in the `nextflow.config` file. Thus, the default (both local Nextflow executor and DNAnexus executor) strategy is 'terminate'. For more detailed information on choosing an errorStrategy [see this section](#error-strategies)

**queue-size** I have also not defined the queueSize, so when this applet is run, a max of 5 subjobs will run at any one time in parallel, unless you pass the `-queue-size` flag to the `nextflow_run_opts` options for the applet

#### The nextflow\_schema.json file

The `nextflow_schema.json` file is needed to reflect the nextflow params (--samplesheet, --reads\_dir and --outdir in this case) as DNAnexus applet inputs in the CLI and UI. If it is not present, you will not get the -isamplesheet, -ireads\_dir and -ioutdir options for your applet inputs. You can also use it to do parameter validation at runtime using plugins such as [nf-validation](https://github.com/nextflow-io/nf-validation).

**nextflow\_schema.json**

```
{
  "$schema": "http://json-schema.org/draft-07/schema",
  "$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
  "title": "Nextflow pipeline parameters",
  "description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
  "type": "object",
  "definitions": {
      "inputs": {
          "title": "Inputs",
          "type": "object",
          "description": "",
          "default": "",
          "properties": {
              "samplesheet": {
                  "type": "string",
                  "description": "Input samplesheet in CSV format",
                  "format": "file-path"
              },
              "reads_dir": {
                "type": "string",
                "description": "Reads directory for file pairs with wildcard",
                "format": "directory-path"
            },             
              "outdir": {
                  "type": "string",
                  "format": "directory-path",
                  "description": "Local path to output directory",
                  "default": "./results"
              }
          }
      }
  },
  "allOf": [
      {
          "$ref": "#/definitions/inputs"
      }
  ]
}
```

#### Creating a nextflow\_schema.json file

Once you have written your script and know your parameters, you can make the schema quite quickly using the [nfcore pipeline schema builder website](https://nf-co.re/pipeline_schema_builder). *Note: do not put sensitive information into this builder as information in it is stored by nfcore for 2 weeks.*

There is also the option of using [nfcore tools](https://nf-co.re/tools#build-a-pipeline-schema) `nfcore schema` tools on your computer to create it. You may need to manually add in `format` of either `file-path` and `directory-path` to some parameters if it doesn't do it for you.

Here we will explain how to use the [nfcore pipeline schema builder website](https://nf-co.re/pipeline_schema_builder)

1. In the `New Schema` section, click the blue `Submit` button to start.
2. Near the top of the page, click the 'Add group' button. You need at least one group in your schema file to have it function on platform. All parameters must be placed into a group (you can do this by dragging and dropping them into the group). For example you might have one group called Inputs for all your input parameters and a group called Output for your output parameters with the appropriate parameters placed into the correct groups. Click `required` for every non optional parameter.
3. The default type of input is a string input. For file and directory path input parameters, click the little wheel to the right
4. At the bottom of the popup in the Format section, for a file input, choose `File path` or for a directory path choose `Directory path`. Having these 2 correct is important for how the you specify the inputs on platform.
5. When you are finished building your schema file, click 'Finished', then 'Copy pipeline schema' and paste the information into a file called nextflow\_schema.json in the same directory as your applet main.nf and nextflow\.config files.
6. If you note the `Schema cache ID` then you can type that into the website to pull up and edit that file within 14 days.

To remove an input parameter for the pipeline from the UI and CLI, you can delete it from the nextflow\_schema.json file, or place it in a section of the nextflow\_schema.json file that is not referenced in the `allOf` section at the bottom of the json file.

You can also remove entire sections by removing their reference from the `allOf` section without deleting them from the file.

#### **Build the nextflow applet**

Ensure that you are in the project that you want to build the applet in using `dx pwd` or `dx env`. `dx select` the correct project if required.

```
#select project
dx select project-ID
```

Assuming you have the folder called fastqc-nf with these contents (main.nf is required at a minimum):

```
main.nf 
nextflow.config
nextflow_schema.json
```

Build applet - the applet will build in the root of your project

If you are in the fastqc-nf folder on your machine you will need to `cd ..` back a level for the command below to work

```
dx build --nextflow fastqc-nf
```

or build using `--destination` to set the project level folder for the applet

```
dx build -a --nextflow fastqc-nf --destination project-XXXXX:/TEST/fastqc-nf
```

or to build in root of project and just change the name to test-fastqc-nf run

```
dx build -a --nextflow fastqc-nf --destination project-XXXXX:/test-fastqc-nf
```

You should see an output like the one below but with a different applet ID.

```
{"id": "applet-ID"}
```

Use `-a` with `dx build` to archive previous versions of your applet and `-f` to force overwrite previous applet versions. The archived versions are placed in a folder called `.Applet_archive` in the root of the project.

You can see the build help using `dx build -h` or `dx build --help`

#### How file-path and directory-path in nextflow\_schema.json affect run options

*In the DNAnexus UI*:

* `file-path` will be rendered as a file-picker which enables loading of a file object by selecting it in the UI (can only select one file)
* `directory-path` will be rendered as a string and will appear in the UI as a text box input. You can point to a directory by typing a string path such as `dx://<project-id>:/test/` in the box or multiple files in a path such as `dx://<project-id>:/test/*_R{1,2}.fastq.gz`
* `string` is rendered as a string and appears as a text box input on the UI.

Here is part of the fastqc-nf run setup screen

![](https://1979569080-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPtCOm9rXoRi4P9rh1ET8%2Fuploads%2Fgit-blob-b1687e675b6a92bc464d0403705195f067374caa%2Ffastqc-nf-run-screen-1.png?alt=media)

Notice how samplesheet has 'Select File' and a file icon but outdir and reads\_dir have text input boxes.

-This is because samplesheet was given 'file-path' in the nextflow\_schema.json, but outdir and reads\_dir were given as directory-path which renders as a string input, hence the text-box.

*In the DNAnexus CLI*:

Run the applet with `-h` to see the input parameters for the applet

```
dx run fastqc-nf -h
```

Excerpt of output from command above

```
usage: dx run fastqc-nf [-iINPUT_NAME=VALUE ...]

Applet: fastqc-nf

fastqc-nf

Inputs:
  outdir: [-ioutdir=(string)]
        (Nextflow pipeline required) Default value:./results

  reads_dir: [-ireads_dir=(string)]
        (Nextflow pipeline required)

  samplesheet: [-isamplesheet=(file)]
        (Nextflow pipeline required)

        ....
```

* `string` will appear as class `string` e.g., for param `outdir`

  The default here is what we specified as the default in nextflow\_schema.json. It cannot 'see' the default that we set in the `nextflow.config` so make sure they match when building the json.

  ```
  outdir: [-ioutdir=(string)]
      (Nextflow pipeline required) Default value:./results
  ```
* `directory-path` will appear as class `(string)` e.g., for param `reads_dir`

  ```
  reads_dir: [-ireads_dir=(string)]
      (Nextflow pipeline required)
  ```

  When `(string)` given for parameter (used for folderpaths and strings; the input is of the 'string' class), use `dx://project-XXXXX:/path/to/folder` e.g., `dx run fastqc-nf -ireads_dir=dx://project-GgYbKGQ0QFpxF6qkPK4KxQ6Q:/FASTQ/*_{1,2}.fastq.gz`
* `file-path` will appear as class `file` e.g. for param `samplesheet`:

  ```
  samplesheet: [-isamplesheet=(file)]
      (Nextflow pipeline required)
  ```

  When `(file)` is given for parameter (i.e., the input is of the 'file' class), use `project-XXXXX:/path/to/file` e.g., `dx run fastqc-nf -isamplesheet=project-XXXXX:/samplesheet-example.csv ....`

See [here](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines#nextflow-input-parameter-type-conversion-to-dnanexus-executable-input-parameter-class) for more information on options for `nextflow_schema.json` on DNAnexus.

### Running the Nextflow Pipeline Applet

#### Using samplesheets

When placing a path to a file on the DNAnexus platform in a samplesheet it would use the format of `dx://project-xxx:/path/to/file`

Here is an example of a samplesheet with one sample (format of samplesheet is determined by you - this is just for illustration purposes)

```
sample_name,fastq_1,fastq_2
sampleA,dx://project-xxx:/path/to/sampleA_r1.fastq.gz,dx://project-xxx:/path/to/sampleA_r2.fastq.gz
```

#### Run the applet from the UI

1. In your project on platform, click the fastqc

![](https://1979569080-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPtCOm9rXoRi4P9rh1ET8%2Fuploads%2Fgit-blob-1e8c47e0a41403650705f0eff2ca1494f79cfb3c%2Ffastqc-nf.png?alt=media)

2. In the run applet screen, click 'Output to' and choose your output location.

   <figure><img src="https://1979569080-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPtCOm9rXoRi4P9rh1ET8%2Fuploads%2Fgit-blob-b1687e675b6a92bc464d0403705195f067374caa%2Ffastqc-nf-run-screen-1-01.png?alt=media" alt=""><figcaption></figcaption></figure>
3. Click 'Next'
4. At the setup screen, either input a samplesheet or a write the path reads\_dir. In the image below, I have used the reads\_dir param. Replace 'project-xxx' and '/path/to/reads' with your project-ID and folder name that reads are in.

![](https://1979569080-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPtCOm9rXoRi4P9rh1ET8%2Fuploads%2Fgit-blob-d289045698e3d3eb4ff7f0d78e11aacef9e1db8d%2Ffastqc-nf-inputs.png?alt=media)

5. Review the rest of the inputs and change anything that you want e.g, turn on 'preserve\_cache' etc.

![](https://1979569080-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPtCOm9rXoRi4P9rh1ET8%2Fuploads%2Fgit-blob-c7e9546eb022c44c219862e31870978cb27c82f5%2Ffastqc-nf-run-screen-2.png?alt=media)

6. Click start analysis

![](https://1979569080-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPtCOm9rXoRi4P9rh1ET8%2Fuploads%2Fgit-blob-81a69a61407821fa23344ea137b60c7a9d77d97e%2Fstart-analysis-new.png?alt=media)

7. Review the name, output location etc

![](https://1979569080-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPtCOm9rXoRi4P9rh1ET8%2Fuploads%2Fgit-blob-496afe0086e3396e7c509dceb2a2885e90b58df2%2Freview-and-start.png?alt=media)

8. Click 'Launch Analysis'

### Run the applet on the CLI

**Running the fastqc applet with the reads\_dir as input**

* I am turning on `preserve_cache` and using `-inextflow_run_opts` in the command below for demonstration of how to add them to the command but neither are required here
* Note that the `*_{1,2}.fastq.gz` is needed here for Channel.fromFilePairs to correctly pair up related files
* I do not need `-profile docker` in `-inextflow_run_opts` as docker was enabled in the `nextflow.config` for this applet
* `--name` names the job

```
dx run fastqc-nf \
-ireads_dir="dx://project-ID:/FASTQ/*_{1,2}.fastq.gz" \
-ioutdir="./fastqc-out-rd" \
-ipreserve_cache=true \
-inextflow_run_opts='-queue-size 10' \
--destination "project-ID:/USERS/FOLDERNAME" \
--name fastqc-nf-with-reads-dir \
-y
```

**Running the fastqc applet with the samplesheet as input**

```
dx run fastqc-nf -isamplesheet="project-ID:/samplesheet-example.csv" \
-ioutdir="./fastqc-out-sh" \
--destination "project-ID:/USERS/FILENAME" \
--name fastqc-nf-with-samplesheet \
-y
```

Notice the different way that the path to the samplesheet is specified compared to the reads\_dir in the previous example. You can read more about how this [here](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines#formats-of-path-to-file-folder-or-wildcards).

### Resources

[Full Documentation](https://documentation.dnanexus.com/)

To create a support ticket if there are technical issues:

1. Go to the Help header (same section where Projects and Tools are) inside the platform
2. Select "Contact Support"
3. Fill in the Subject and Message to submit a support ticket.

*Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.*
