Example 4: cnvkit

To begin, you'll create a bash app to run CNVKit, which will find "genome-wide copy number from high-throughput sequencing." Create a local directory to hold your work, and consider putting the contents into a source code repository like Git.

In this example, you will:

Use various package managers to install dependencies
Build an asset
Learn to use dx-download-all-inputs and dx-upload-all-outputs

Create a Project

From the web interface, select "Projects → All Projects" to see your project list. Click the "New Project" button to create a new project called "CNVkit." Alternatively, use dx new project to do this from the command line. However you choose to create a project, be sure this has been selected by running dx pwd to check your current working directory and using dx select to select the project, if needed.

Build a bash app with dx-app-wizard

Inside your working directory, run the command dx-app-wizard cnvkit_bash to launch the app wizard tool. Optionally provide a title, summary, and version at the prompts.

The Input Specification

The app will accept two inputs:

One or more BAM files of the tumor samples: Give this input the name bam_tumor with the label "BAM Tumor Files." For the class, choose array:file, and indicate that this is not an optional parameter.
A reference file: Give this input the name reference with the label "Reference." For the class, choose file, and indicate that this is not an optional parameter.

When prompted for the third input, press Enter to end the inputs.

The Output Specification

Define three outputs, each of the type array:file with the following names and whatever labels you feel are appropriate:

cns
cns_filtered
plot

Press Enter when prompted for the fourth output to indicate you are finished.

Other Options

Press Enter to accept the default value for the timeout policy.
Type bash for the programming language.
Type y to indicate that the app will need internet access.
Type n to indicate that the app will need access to the parent project.
Press Enter to accept the default value for the instance type or select one from the list shown.

You should see a message saying the app's template was created in a directory name matching the app's name. For instance, I have the following:

$ find cnvkit_bash -type f
cnvkit_bash/dxapp.json 
cnvkit_bash/Readme.md 
cnvkit_bash/Readme.developer.md 
cnvkit_bash/src/cnvkit_bash.sh

This is a JSON file containing metadata that will be used to create the app on the DNAnexus platform.
A stub for user documentation.
A stub for developer documentation.
A template bash script for the app's functionality.

Examine dxapp.json

The dxapp.json file that was created by the wizard should look like the following:

{
  "name": "cnvkit_bash",
  "title": "cnvkit_bash",
  "summary": "cnvkit_bash",
  "dxapi": "1.0.0",
  "version": "0.0.1",
  "inputSpec": [
    {
      "name": "bam_tumor",
      "label": "BAM Tumor Files",
      "class": "array:file",
      "optional": false,
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "reference",
      "label": "Reference",
      "class": "file",
      "optional": false,
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "outputSpec": [
    {
      "name": "cns",
      "label": "CNS",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "cns_filtered",
      "label": "CNS Filtered",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "plot",
      "label": "Plot",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0"
  },
  "access": {
    "network": [
      "*"
    ]
  },
  "regionalOptions": {
    "aws:us-east-1": {
      "systemRequirements": {
        "*": {
          "instanceType": "mem1_ssd1_v2_x4"
        }
      }
    }
  }
}

See the app metadata documentation for a more complete understanding of all the possible fields and their implications.

Add Python and R Module Dependencies

CNVkit has dependencies on both Python and R modules that must be installed before running. In the dxapp.json, you can specify dependencies that can be installed with the following package managers:

apt (Ubuntu)
pip (Python)
cpan (Perl)
cran (\R)
gem (Ruby)

The Python module cnvkit can be installed via pip, but the software also requires an R module called DNAcopy that must be installed using Bioconductor, which must first be installed using cran. This means you'll have to manually install the DNAcopy module when the app starts.

To add these runtime dependencies, use a text editor to update the runSpec and add the following execDepends section that will install the Python cnvkit and R BiocManager modules before the app is executed:

"runSpec": {
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0",
    "execDepends": [
      {
        "name": "cnvkit",
        "package_manager": "pip"
      },
      {
        "name": "BiocManager",
        "package_manager": "cran"
      }
    ],
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    }
}

Specify File Patterns for Inputs

In the inputSpec, change the patterns to match the expected file extensions:

bam_files: *.bam
reference: *.cnn

Your dxapp.json should now look like the following:

{
  "name": "cnvkit_bash",
  "title": "cnvkit_bash",
  "summary": "cnvkit_bash",
  "dxapi": "1.0.0",
  "version": "0.0.1",
  "inputSpec": [
    {
      "name": "bam_tumor",
      "label": "BAM Tumor Files",
      "class": "array:file",
      "optional": false,
      "patterns": [
        "*.bam"
      ],
      "help": ""
    },
    {
      "name": "reference",
      "label": "Reference",
      "class": "file",
      "optional": false,
      "patterns": [
        "*.cnn"
      ],
      "help": ""
    }
  ],
  "outputSpec": [
    {
      "name": "cns",
      "label": "CNS",
      class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "cns_filtered",
      "label": "CNS Filtered",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "plot",
      "label": "Plot",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "execDepends": [
      {
        "name": "cnvkit",
        "package_manager": "pip"
      },
      {
        "name": "BiocManager",
        "package_manager": "cran"
      }
    ],
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0"
  },
  "access": {
    "network": [
      "*"
    ]
  },
  "regionalOptions": {
    "aws:us-east-1": {
      "systemRequirements": {
        "*": {
          "instanceType": "mem1_ssd1_v2_x4"
        }
      }
    }
  }
}

Edit the bash Code

The default bash code generated by the wizard starts with a generous header of comments that you may or may not wish to keep. The default code prints the values of the input variables, then downloads the input files individually. The app code belongs in the middle, after downloading the inputs and before uploading the outputs:

main() {

    echo "Value of bam_tumor: '${bam_tumor[@]}'"
    echo "Value of reference: '$reference'"

    # The following line(s) use the dx command-line tool to download your file
    # inputs to the local file system using variable names for the filenames. To
    # recover the original filenames, you can use the output of "dx describe
    # "$variable" --name".

    dx download "$reference" -o reference
    for i in ${!bam_tumor[@]}
    do
        dx download "${bam_tumor[$i]}" -o bam_tumor-$i
    done

    >>>>> Here is where the app code belongs <<<<<

    # The following line(s) use the dx command-line tool to upload your file
    # outputs after you have created them on the local file system.  It assumes
    # that you have used the output field name for the filename for each output,
    # but you can change that behavior to suit your needs.  Run "dx upload -h"
    # to see more options to set metadata.

    cns=$(dx upload cns --brief)
    cns_filtered=$(dx upload cns_filtered --brief)
    plot=$(dx upload plot --brief)

    # The following line(s) use the utility dx-jobutil-add-output to format and
    # add output variables to your job's output as appropriate for the output
    # class.  Run "dx-jobutil-add-output -h" for more information on what it
    # does.

    dx-jobutil-add-output cns "$cns" --class=file
    dx-jobutil-add-output cns_filtered "$cns_filtered" --class=file
    dx-jobutil-add-output plot "$plot" --class=file
}

Replace src/cnvkit_bash.sh this with the following code:

#!/bin/bash

# Set pragmas to print commands and fail on errors
set -exuo pipefail

# Install required R module
Rscript -e "BiocManager::install('DNAcopy')"

# Verify the value of inputs
echo "Value of bam_tumor: '${bam_tumor[@]}'"
echo "Value of reference: '$reference'"

# Place all inputs into the "in" directory
dx-download-all-inputs --parallel

# Use "_path" versions of inputs for file paths
cnvkit.py batch \
    ${bam_tumor_path[@]} \
    -r ${reference_path} \
    -p $(expr $(nproc) - 1) \
    -d cnvkit-out/ \
    --scatter

# Make out directories for each output spec
mkdir -p ~/out/cns/ ~/out/cns_filtered/ ~/out/plot/

# Move CNVkit outputs to the "out" directory for upload
mv cnvkit-out/*.call.cns    ~/out/cns_filtered/
mv cnvkit-out/*.cns         ~/out/cns/
mv cnvkit-out/*-scatter.png ~/out/plot/

# Upload and annotate all output files
dx-upload-all-outputs --parallel

Rather than downloading the inputs individually as in the original template, this version downloads the all inputs in parallel with the following command:

dx-download-all-inputs --parallel

This will create an in directory with subdirectories named according to the input names. Note that bam_files input is an array of files, so this directory will contain numbered subdirectories starting at 0 for each input file:

in/bam_files/0/...
in/bam_files/1/...
in/reference/...

Similarly, the preceding code uses dx-upload-all-outputs, which expects an out directory with subdirectories named according to each of the output specifications.

Build the Applet

Use dx pwd to ensure you are in the correct project and dx select to change projects, if necessary. If you are inside the bash source directory where the dxapp.json file exists, you can run dx build -f If you are in the parent directory, run dx build -f cnvkit_bash. Here is a sample output from successfully compiling the app:

$ dx build -f
{"id": "applet-GFyV3kj0VGFkV8k04f3K11QY"}

The -f|--overwrite flag indicates you wish to overwrite any previous version of the applet. You may also want to use the -a|--archive flag to move any previous versions to an archived location. You won't need either of these flags the first time you compile, but subsequent builds will require that you indicate how to handle previous versions of the applet. Run dx build --help to learn more about build options.

Run the bash applet

Download this BAM file and add it to the inputs directory

15MB

BAM.zip

Build an Asset

You'll notice the applet takes quite a while to run (around 14 minutes for me) because of the module installations. You can build an asset for these installations and use this in dxapp.json. Create a directory called cnvkit_asset with the following file dxasset.json:

{
    "name": "cnvkit_asset",
    "title": "cnvkit_asset",
    "description": "cnvkit_asset",
    "version": "0.0.1",
    "distribution": "Ubuntu",
    "release": "20.04",
    "execDepends": [
        {
          "name": "cnvkit",
          "package_manager": "pip"
        },
        {
          "name": "BiocManager",
          "package_manager": "cran"
        }
    ]
}

Also create a Makefile with the following contents:

SHELL=/bin/bash -exuo pipefail
all:
    sudo Rscript -e "BiocManager::install('DNAcopy')"

Run dx build_asset to create the asset. This will launch a job that will report the asset ID at the end:

Asset bundle 'record-GFyVY000X1ZK3yGg4qv32GXv' is built and can now be used
in your app/applet's dxapp.json

Update the runSpec in dxapp.json to the following:

  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "assetDepends": [{"id": "record-GFyVY000X1ZK3yGg4qv32GXv"}],
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0"
  },

Use dx build -f and note the new app's ID. Create a JSON input as follows:

$ cat inputs.json
{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}

Launch the new app from the CLI with the following command:

$ dx run applet-GFyVppQ0VGFxvvx44j43YyPz -f inputs.json -y

Use dx watch with the new job ID to see how the run now uses the asset to run faster. I see about a 10-minute difference with the asset.

Review

You learned more ways to include app dependencies using package managers and a Makefile as well as by building an asset. The first strategy happens at runtime while the latter builds all the dependencies before the applet is run, making the runtime much faster.

Resources

Full Documentation

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

PreviousExample 3: samtools NextExample 5: samtools with a Docker Image

Last updated 3 months ago

Was this helpful?