# Example 5: workflow

In this example, you will learn:

* How to to accept a BAM file as a workflow input
* Break the BAM into slices by chromosome
* Distribute the slices in parallel to count the number of alignments in each

## Getting Started

To begin, create a new directory called *view\_and\_count* and a *workflow\.wdl* file.

Here is the `workflow` defintion you should add:

```
version 1.0

workflow bam_chrom_counter { 
    input {
        File bam 
    }

    String docker_img = "quay.io/biocontainers/samtools:1.12--hd5e65b6_0" 

    call slice_bam {
        input : bam = bam, 
                docker_img = docker_img
    }

    scatter (slice in slice_bam.slices) { 
        call count_bam {
            input: bam = slice,
                   docker_img = docker_img
        }
    }

    output { 
        File bai = slice_bam.bai
        Array[Int] count = count_bam.count
    }
}
```

* The name of this workflow is *bam\_chrom\_counter*.
* The workflow accepts a single, required `File` input that will be called `bam` as it is expected to be a BAM file.
* Use a [non-input declaration](https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#non-input-declarations) to define a `String` value of the Docker file containing Samtools.
* The first `call` will be to the `slice_bam` task that will break the BAM into one file per chromosome. The input for this task is the workflow's BAM file.
* The [`scatter`](https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#scatter) directive in WDL causes the actions in the block to be executed in parallel, which can lead to significant performance gains. Here, the each `slice` file returned from the `slice_bam` task will be used as the input to the `count_bam` task.
* The workflow defines two outputs: a BAM index file and an array of integer values representing the number of alignments in each of the BAM slices.

Following is the `slice_bam` task that uses [Samtools](http://www.htslib.org/) to index the input BAM file and break it into separate files for each of the 22 human chromosomes:

```
task slice_bam {
    input { 
        File bam
        String docker_img
    }

    command <<< 
    set -ex
    samtools index "~{bam}" 
    mkdir slices

    for i in $(seq 22); do 
        samtools view -b -o "slices/$i.bam" "~{bam}" "chr${i}" 
    done
    >>>

    runtime { 
        docker: docker_img
    }

    output { 
        File bai = "~{bam}.bai"
        Array[File] slices = glob("slices/*.bam") 
    }
}
```

* The inputs to this task are the BAM file and the name of the Docker image.
* The command block uses triple-angle brackets because it must use the dollar sign (`$`) in shell code.
* Use [`samtools index`](http://www.htslib.org/doc/samtools-index.html) on the input BAM file for fast random access to the alignments.
* The `$()` syntax in bash calls the `seq` function to create a sequence of integer values up the 22 human non-sex chromosomes.
* The [`samtools view`](http://www.htslib.org/doc/samtools-view.html) will display the alignments in BAM format for a region like "chr1" and place the output into the file *slices/1.bam*. Note the mix of `~` for WDL variables and `$` for bash variables.
* The [`runtime`](https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#runtime-section) block allows you to define a Docker image that contains an installation of Samtools.
* The output of this task is the BAM index, which is the given BAM file plus the suffix *.bai*, and the sliced alignment files.
* The `slices` will be one or more files as indicated by `Array[File]`, and they will be found using the [`glob`](https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#globs) function to look in the *slices* directory for all files with the extension *.bam*.

The `count_bam` task is written to handle just one BAM slice:

```
task count_bam {
    input {
        File bam 
        String docker_img
    }

    command <<<
        samtools view -c "~{bam}" 
    >>>

    runtime {
        docker: docker_img
    }

    output {
        Int count = read_int(stdout()) 
    }
}
```

* This BAM input will be a slice of alignments for a given region. Naming this `bam` does not interfere with the `bam` variable in the workflow or any other task.
* Use the [`samtools view`](http://www.htslib.org/doc/samtools-view.html) command with `-c|--count` to count the number of alignments in the given file.
* The output of this task uses the [`read_int`](https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#int-read_intstringfile) function to read the `STDOUT` from the command as an integer value.

At this point, I like to use `miniwdl` to check the syntax:

```
$ miniwdl check workflow.wdl
workflow.wdl
    workflow bam_chrom_counter
        call slice_bam
        scatter slice
            call count_bam
    task count_bam
    task slice_bam
```

As no errors are reported, I will compile this onto the DNAnexus platform:

```
$ java -jar ~/dxCompiler-2.10.2.jar compile workflow.wdl \
        -archive \
        -folder /workflows \
        -project project-GFPQvY007GyyXgXGP7x9zbGb
workflow-GFqF27j07GyZ33JX4vzqgK32
```

Finally, I will run this workflow using a sample BAM file:

```
$ dx run workflow-GFqF27j07GyZ33JX4vzqgK32 \
> -istage-common.bam=file-G8V38KQ0zQ713kZGF6xQQvjJ -y

Using input JSON:
{
    "stage-common.bam": {
        "$dnanexus_link": "file-G8V38KQ0zQ713kZGF6xQQvjJ"
    }
}

Calling workflow-GFqF27j07GyZ33JX4vzqgK32 with output destination
  project-GFPQvY007GyyXgXGP7x9zbGb:/

Analysis ID: analysis-GFqF7Zj07GyZQ957Jy822gQY
```

Return to the DNAnexus website to monitor the progress of the analysis.

## Placing Task Definitions in Files

As the number of tasks increase, workflow definitions can get quite long. You can shorten the *workflow\.wdl* by placing each task in a separate file, which also makes it easier to reuse a task in a separate workflow. To do this, create a subdirectory called *tasks*, and then create a file called *tasks/slice\_bam.wdl* with the following contents:

```
version 1.0

task slice_bam {
    input {
        File bam
        String docker_img
    }

    command <<<
    set -ex
    samtools index "~{bam}"
    mkdir slices

    for i in $(seq 22); do
        samtools view -b -o "slices/$i.bam" "~{bam}" "chr${i}"
    done
    >>>

    runtime {
        docker: docker_img
    }

    output {
        File bai = "~{bam}.bai"
        Array[File] slices = glob("slices/*.bam")
    }
}
```

Also create the file *tasks/count\_bam.wdl* with the following contents:

```
version 1.0

task count_bam {
    input {
        File bam
        String docker_img
    }

    command <<<
        samtools view -c "~{bam}"
    >>>

    runtime {
        docker: docker_img
    }

    output {
        Int count = read_int(stdout())
    }
}
```

Both of the preceding tasks are identical to the original definitions, but note that the files include a `version` that matches the version of the workflow. Change *workflow\.wdl* as follows:

```
version 1.0

import "./tasks/slice_bam.wdl" as task_slice_bam 
import "./tasks/count_bam.wdl" as task_count_bam

workflow bam_chrom_counter {
    input {
        File bam
    }

    String docker_img = "quay.io/biocontainers/samtools:1.12--hd5e65b6_0"

    call task_slice_bam.slice_bam as slice_bam { 
        input : bam = bam,
                docker_img = docker_img
    }

    scatter (slice in slice_bam.slices) {
        call task_count_bam.count_bam as count_bam { 
            input: bam = slice,
                   docker_img = docker_img
        }
    }

    output {
        File bai = slice_bam.bai
        Array[Int] count = count_bam.count
    }
}
```

* Use [`import`](https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#import-statements) to include WDL code from a file or URI. Note the use of the `as` clause to alias the imports using a different name.
* Call `task_slice_bam.slice_bam` from the imported file using `as` to give it the same name as in the original workflow.
* Do the same with `task_count_bam.count_bam`.

Use `miniwdl` to check your syntax, then use dxCompiler to create an app.

## Review

In this lesson, you learned how to:

* Accept a file as a workflow input
* Define a non-input declaration
* Use `scatter` to run tasks in parallel
* Use the output from one task as the input to another task
* Mix `~` and `$` in command blocks to dereference WDL and shell variables
* Import WDL from external sources such as local files or remote URIs.

## Resources

[Full Documentation](https://documentation.dnanexus.com/)

To create a support ticket if there are technical issues:

1. Go to the Help header (same section where Projects and Tools are) inside the platform
2. Select "Contact Support"
3. Fill in the Subject and Message to submit a support ticket.
