# Example 4: cnvkit

To begin, you'll create a bash app to run [CNVKit](https://cnvkit.readthedocs.io/en/stable/), which will find "genome-wide copy number from high-throughput sequencing." Create a local directory to hold your work, and consider putting the contents into a source code repository like Git.

In this example, you will:

* Use various package managers to install dependencies
* Build an asset
* Learn to use `dx-download-all-inputs` and `dx-upload-all-outputs`

## Create a Project

From the web interface, select "Projects → All Projects" to see your project list. Click the "New Project" button to create a new project called "CNVkit." Alternatively, use `dx new project` to do this from the command line. However you choose to create a project, be sure this has been selected by running `dx pwd` to check your current working directory and using `dx select` to select the project, if needed.

## Build a bash app with dx-app-wizard

Inside your working directory, run the command `dx-app-wizard cnvkit_bash` to launch the [app wizard tool](https://documentation.dnanexus.com/developer/apps/intro-to-building-apps#other-app-wizard-templates). Optionally provide a title, summary, and version at the prompts.

### The Input Specification

The app will accept two inputs:

1. One or more BAM files of the tumor samples: Give this input the name *bam\_tumor* with the label "BAM Tumor Files." For the class, choose *array:file*, and indicate that this is **not** an optional parameter.
2. A reference file: Give this input the name *reference* with the label "Reference." For the class, choose *file*, and indicate that this is **not** an optional parameter.

When prompted for the third input, press *Enter* to end the inputs.

### The Output Specification

Define three outputs, each of the type *array:file* with the following names and whatever labels you feel are appropriate:

1. *cns*
2. *cns\_filtered*
3. *plot*

Press *Enter* when prompted for the fourth output to indicate you are finished.

### Other Options

* Press *Enter* to accept the default value for the timeout policy.
* Type *bash* for the programming language.
* Type *y* to indicate that the app will need internet access.
* Type *n* to indicate that the app will need access to the parent project.
* Press *Enter* to accept the default value for the instance type or select one from the list shown.

You should see a message saying the app's template was created in a directory name matching the app's name. For instance, I have the following:

```
$ find cnvkit_bash -type f
cnvkit_bash/dxapp.json 
cnvkit_bash/Readme.md 
cnvkit_bash/Readme.developer.md 
cnvkit_bash/src/cnvkit_bash.sh 
```

* This is a JSON file containing metadata that will be used to create the app on the DNAnexus platform.
* A stub for user documentation.
* A stub for developer documentation.
* A template bash script for the app's functionality.

## Examine dxapp.json

The *dxapp.json* file that was created by the wizard should look like the following:

```
{
  "name": "cnvkit_bash",
  "title": "cnvkit_bash",
  "summary": "cnvkit_bash",
  "dxapi": "1.0.0",
  "version": "0.0.1",
  "inputSpec": [
    {
      "name": "bam_tumor",
      "label": "BAM Tumor Files",
      "class": "array:file",
      "optional": false,
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "reference",
      "label": "Reference",
      "class": "file",
      "optional": false,
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "outputSpec": [
    {
      "name": "cns",
      "label": "CNS",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "cns_filtered",
      "label": "CNS Filtered",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "plot",
      "label": "Plot",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "24.04",
    "version": "0"
  },
  "access": {
    "network": [
      "*"
    ]
  },
  "regionalOptions": {
    "aws:us-east-1": {
      "systemRequirements": {
        "*": {
          "instanceType": "mem1_ssd1_v2_x4"
        }
      }
    }
  }
}
```

See the [app metadata documentation](https://documentation.dnanexus.com/developer/apps/app-metadata) for a more complete understanding of all the possible fields and their implications.

## Add Python and R Module Dependencies

CNVkit has dependencies on both Python and R modules that must be installed before running. In the `dxapp.json`, you can specify dependencies that can be installed with the following package managers:

* `apt` (Ubuntu)
* `pip` (Python)
* `cpan` (Perl)
* `cran` (\R)
* `gem` (Ruby)

The Python module `cnvkit` can be installed via `pip`, but the software also requires an R module called `DNAcopy` that must be installed using [Bioconductor](https://www.bioconductor.org/install/), which must first be installed using `cran`. This means you'll have to manually install the `DNAcopy` module when the app starts.

To add these runtime dependencies, use a text editor to update the *runSpec* and add the following *execDepends* section that will install the Python `cnvkit` and R `BiocManager` modules before the app is executed:

```
"runSpec": {
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0",
    "execDepends": [
      {
        "name": "cnvkit",
        "package_manager": "pip"
      },
      {
        "name": "BiocManager",
        "package_manager": "cran"
      }
    ],
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    }
}
```

## Specify File Patterns for Inputs

In the *inputSpec*, change the *patterns* to match the expected file extensions:

* *bam\_files*: \*.bam
* *reference*: \*.cnn

Your *dxapp.json* should now look like the following:

```
{
  "name": "cnvkit_bash",
  "title": "cnvkit_bash",
  "summary": "cnvkit_bash",
  "dxapi": "1.0.0",
  "version": "0.0.1",
  "inputSpec": [
    {
      "name": "bam_tumor",
      "label": "BAM Tumor Files",
      "class": "array:file",
      "optional": false,
      "patterns": [
        "*.bam"
      ],
      "help": ""
    },
    {
      "name": "reference",
      "label": "Reference",
      "class": "file",
      "optional": false,
      "patterns": [
        "*.cnn"
      ],
      "help": ""
    }
  ],
  "outputSpec": [
    {
      "name": "cns",
      "label": "CNS",
      class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "cns_filtered",
      "label": "CNS Filtered",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "plot",
      "label": "Plot",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "execDepends": [
      {
        "name": "cnvkit",
        "package_manager": "pip"
      },
      {
        "name": "BiocManager",
        "package_manager": "cran"
      }
    ],
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0"
  },
  "access": {
    "network": [
      "*"
    ]
  },
  "regionalOptions": {
    "aws:us-east-1": {
      "systemRequirements": {
        "*": {
          "instanceType": "mem1_ssd1_v2_x4"
        }
      }
    }
  }
}
```

## Edit the bash Code

The default bash code generated by the wizard starts with a generous header of comments that you may or may not wish to keep. The default code prints the values of the input variables, then downloads the input files individually. The app code belongs in the middle, after downloading the inputs and before uploading the outputs:

```
main() {

    echo "Value of bam_tumor: '${bam_tumor[@]}'"
    echo "Value of reference: '$reference'"

    # The following line(s) use the dx command-line tool to download your file
    # inputs to the local file system using variable names for the filenames. To
    # recover the original filenames, you can use the output of "dx describe
    # "$variable" --name".

    dx download "$reference" -o reference
    for i in ${!bam_tumor[@]}
    do
        dx download "${bam_tumor[$i]}" -o bam_tumor-$i
    done

    >>>>> Here is where the app code belongs <<<<<

    # The following line(s) use the dx command-line tool to upload your file
    # outputs after you have created them on the local file system.  It assumes
    # that you have used the output field name for the filename for each output,
    # but you can change that behavior to suit your needs.  Run "dx upload -h"
    # to see more options to set metadata.

    cns=$(dx upload cns --brief)
    cns_filtered=$(dx upload cns_filtered --brief)
    plot=$(dx upload plot --brief)

    # The following line(s) use the utility dx-jobutil-add-output to format and
    # add output variables to your job's output as appropriate for the output
    # class.  Run "dx-jobutil-add-output -h" for more information on what it
    # does.

    dx-jobutil-add-output cns "$cns" --class=file
    dx-jobutil-add-output cns_filtered "$cns_filtered" --class=file
    dx-jobutil-add-output plot "$plot" --class=file
}
```

Replace `src/cnvkit_bash.sh` this with the following code:

```
#!/bin/bash

# Set pragmas to print commands and fail on errors
set -exuo pipefail

# Install required R module
Rscript -e "BiocManager::install('DNAcopy')"

# Verify the value of inputs
echo "Value of bam_tumor: '${bam_tumor[@]}'"
echo "Value of reference: '$reference'"

# Place all inputs into the "in" directory
dx-download-all-inputs --parallel

# Use "_path" versions of inputs for file paths
cnvkit.py batch \
    ${bam_tumor_path[@]} \
    -r ${reference_path} \
    -p $(expr $(nproc) - 1) \
    -d cnvkit-out/ \
    --scatter

# Make out directories for each output spec
mkdir -p ~/out/cns/ ~/out/cns_filtered/ ~/out/plot/

# Move CNVkit outputs to the "out" directory for upload
mv cnvkit-out/*.call.cns    ~/out/cns_filtered/
mv cnvkit-out/*.cns         ~/out/cns/
mv cnvkit-out/*-scatter.png ~/out/plot/

# Upload and annotate all output files
dx-upload-all-outputs --parallel
```

Rather than downloading the inputs individually as in the original template, this version downloads the all inputs in parallel with the following command:

```
dx-download-all-inputs --parallel
```

This will create an *in* directory with subdirectories named according to the input names. Note that *bam\_files* input is an array of files, so this directory will contain numbered subdirectories starting at 0 for each input file:

```
in/bam_files/0/...
in/bam_files/1/...
in/reference/...
```

Similarly, the preceding code uses `dx-upload-all-outputs`, which expects an *out* directory with subdirectories named according to each of the output specifications.

## Build the Applet

Use `dx pwd` to ensure you are in the correct project and `dx select` to change projects, if necessary. If you are inside the bash source directory where the *dxapp.json* file exists, you can run `dx build -f` If you are in the parent directory, run `dx build -f cnvkit_bash`. Here is a sample output from successfully compiling the app:

```
$ dx build -f
{"id": "applet-GFyV3kj0VGFkV8k04f3K11QY"}
```

The `-f|--overwrite` flag indicates you wish to overwrite any previous version of the applet. You may also want to use the `-a|--archive` flag to move any previous versions to an archived location. You won't need either of these flags the first time you compile, but subsequent builds will require that you indicate how to handle previous versions of the applet. Run `dx build --help` to learn more about build options.

## Run the bash applet

Download this BAM file and add it to the inputs directory

{% file src="<https://1979569080-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FPtCOm9rXoRi4P9rh1ET8%2Fuploads%2F3ObBsfC8lFAhRBqJ1JgE%2FBAM.zip?alt=media&token=e5b0dfb6-4d4d-4859-b17c-db9e94b9eeea>" %}

Indicate an output directory, click the Run button, and then click the "View Log" to watch the job's progress.

You can also run the applet on the command line with the `-h|--help` flag to verify the inputs and outputs:

```
$ dx run applet-GFyV3kj0VGFkV8k04f3K11QY -h
usage: dx run applet-GFyV2G8054JBQXY64g4F7ZKk [-iINPUT_NAME=VALUE ...]

Applet: cnvkit_bash

cnvkit_bash

Inputs:
  BAM Tumor Files: -ibam_tumor=(file) [-ibam_tumor=... [...]]

  Reference: -ireference=(file)

Outputs:
  CNS: cns (array:file)

  CNS Filtered: cns_filtered (array:file)

  Plot: plot (array:file)
```

Select the input files on the web interface to note the file IDs that can be used to execute the app from the command line as follows:

```
$ dx run -y --watch applet-GFyV3kj0VGFkV8k04f3K11QY \
    -ibam_tumor=file-GFxXjV006kZVQPb20G85VXBp \
    -ireference=file-GFxXvpj06kZfP0QVKq2p2FGF \
    --destination /outputs
```

You should see output from the preceding command that includes a JSON document with the inputs:

```
Using input JSON:
{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}
```

Note that you can place this JSON into a file and launch the applet with the inputs specified with the `-f|--input-json-file` option, as follows. Use `dx run -h` to learn about other command-line options:

```
$ dx run -y --watch applet-GFyV3kj0VGFkV8k04f3K11QY \
        -f cnvkit_bash/inputs.json \
        --destination /outputs
```

Note the job ID from `dx run`, and use `dx watch` to watch the job to completion and `dx describe` to view the job's metadata. Alternatively, you can use the web platform to launch the job, using the file selector to specify each of the inputs, and then use the "Monitor" view to check the job's status, and view the output reference file when job completes.

## Build an Asset

You'll notice the applet takes quite a while to run (around 14 minutes for me) because of the module installations. You can build an asset for these installations and use this in *dxapp.json*. Create a directory called *cnvkit\_asset* with the following file *dxasset.json*:

```
{
    "name": "cnvkit_asset",
    "title": "cnvkit_asset",
    "description": "cnvkit_asset",
    "version": "0.0.1",
    "distribution": "Ubuntu",
    "release": "20.04",
    "execDepends": [
        {
          "name": "cnvkit",
          "package_manager": "pip"
        },
        {
          "name": "BiocManager",
          "package_manager": "cran"
        }
    ]
}
```

Also create a *Makefile* with the following contents:

```
SHELL=/bin/bash -exuo pipefail
all:
    sudo Rscript -e "BiocManager::install('DNAcopy')"
```

Run `dx build_asset` to create the asset. This will launch a job that will report the asset ID at the end:

```
Asset bundle 'record-GFyVY000X1ZK3yGg4qv32GXv' is built and can now be used
in your app/applet's dxapp.json
```

Update the *runSpec* in *dxapp.json* to the following:

```
  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "assetDepends": [{"id": "record-GFyVY000X1ZK3yGg4qv32GXv"}],
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0"
  },
```

Use `dx build -f` and note the new app's ID. Create a JSON input as follows:

```
$ cat inputs.json
{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}
```

Launch the new app from the CLI with the following command:

```
$ dx run applet-GFyVppQ0VGFxvvx44j43YyPz -f inputs.json -y
```

Use `dx watch` with the new job ID to see how the run now uses the asset to run faster. I see about a 10-minute difference with the asset.

## Review

You learned more ways to include app dependencies using package managers and a *Makefile* as well as by building an asset. The first strategy happens at runtime while the latter builds all the dependencies before the applet is run, making the runtime much faster.

## Resources

[Full Documentation](https://documentation.dnanexus.com/)

To create a support ticket if there are technical issues:

1. Go to the Help header (same section where Projects and Tools are) inside the platform
2. Select "Contact Support"
3. Fill in the Subject and Message to submit a support ticket.
