Example 3: samtools

Building a Native Applet with Bash

Using dx-app-wizard to Create An Applet

In this applet, I'll show how to count the number of reads in a SAM or BAM file using samtools. The SAM format (Sequence Alignment Map) is a tab-delimited text description for sequence alignments, and the BAM format is the same data but stored in binary format for more compression. As the SAM format uses a line break to delineate each record, counting the alignments could be as simple as using wc -l; however, the BAM format requires a program like samtools to read the input file, so I'll show how to install this into the applet's execution environment.

A minimal native applet requires just two files that exist in a directory with the same name as the applet:

dxapp.json: a JSON-formatted metadata file
a bash or Python program to execute

I'll use dx-app-wizard to create a skeleton applet structure with these files:

$ dx-app-wizard
DNAnexus App Wizard, API v1.0.0

Basic Metadata

Please enter basic metadata fields that will be used to describe your app.
Optional fields are denoted by options with square brackets.  At the end of
this wizard, the files necessary for building your app will be generated from
the answers you provide.

First, I must give my applet a name. The prompt shows that I must use only letters, numbers, a dot, underscore, and a dash. As stated earlier, this applet name will also be the name of the directory, and I'll use samtools_count:

The name of your app must be unique on the DNAnexus platform.  After
creating your app for the first time, you will be able to publish new versions
using the same app name.  App names are restricted to alphanumeric characters
(a-z, A-Z, 0-9), and the characters ".", "_", and "-".
App Name: samtools_count

Next, I'm asked for the title. Note that the prompt includes empty square brackets ([]), which contain the default value if I press Enter. As title is not required, it contains the empty string, but I will provide an informative title:

The title, if provided, is what is shown as the name of your app on
the website.  It can be any valid UTF-8 string.
Title []: Samtools Count

Likewise, the summary field is not required:

The summary of your app is a short phrase or one-line description of
what your app does.  It can be any UTF-8 human-readable string.
Summary []: Count SAM/BAM alignments

The version is also optional, and I will press Enter to take the default:

You can publish multiple versions of your app, and the version of your
app is a string with which to tag a particular version.  We encourage the use
of Semantic Versioning for labeling your apps (see http://semver.org/ for more
details).
Version [0.0.1]:

Input Specification

This applet requires a single input, as shows in Table 1.

Input Name

Label

Type

Optional

Default Value

bam

BAM File

file

When prompted for the first input, I'll enter the following:

Input Specification

You will now be prompted for each input parameter to your app.  Each parameter
should have a unique name that uses only the underscore "_" and alphanumeric
characters, and does not start with a number.

1st input name (<ENTER> to finish): bam 
Label (optional human-readable name) []: BAM File 
Your input parameter must be of one of the following classes: 
applet         array:file     array:record   file           int
array:applet   array:float    array:string   float          record
array:boolean  array:int      boolean        hash           string

Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: n

The name of the input will be used as a variable in the bash code, so I will use only letters, numbers, and underscores as in bam or bam_file.
The label is optional, as noted by the empty square brackets.
The types include primitives like integers, floating-point numbers, and strings, as well as arrays of primitive types.
This is a required input. If an input is optional, I can also provide a default value.

When prompted for the second input, press Enter:

2nd input name (<ENTER> to finish):

Output Specification

As showing in Table 2, the applet will produce a single output file containing the number of alignments:

Output Name

Label

Type

counts

Counts File

file

When prompted for the first output name, I enter the following:

Output Specification

You will now be prompted for each output parameter of your app.  Each
parameter should have a unique name that uses only the underscore "_" and
alphanumeric characters, and does not start with a number.

1st output name (<ENTER> to finish): counts 
Label (optional human-readable name) []: Counts File 
Choose a class (<TAB> twice for choices): file

This name will also become a bash variable, so best practice is to use letters, numbers, and underscores.
The label is optional.
The class must be from the preceeding list. To be reminded of the choices, press the Tab key twice.

When prompted for the second output, press Enter:

2nd output name (<ENTER> to finish):

Additional Settings

Here are the final settings I'll use to complete the wizard:

Name

Value

Timeout Policy

48h

Programming language

bash

Access to internet

No (default)

Access to parent project

No (default)

Instance Type

mem1_ssd1_v2_x4 (default)

Applets are required to set a maximum time for running to prevent a job from running an excessively long time. While some applets may legitimately need days to run, most probably need something in the range of 12-48 hours. As noted in the prompt, I can use m, h, or d to specify minutes, hours, or days, respectively:

Timeout Policy

Set a timeout policy for your app. Any single entry point of the app
that runs longer than the specified timeout will fail with a TimeoutExceeded
error. Enter an int greater than 0 with a single-letter suffix (m=minutes,
h=hours, d=days) (e.g. "48h").
Timeout policy [48h]:

For the template language, I must select from bash or Python for the program that is executed when the applet starts. The applet code can execute any program available in the execution environment, including custom programs written in any language. I will choose bash:

Template Options

You can write your app in any programming language, but we provide
templates for the following supported languages: Python, bash
Programming language: bash

Next, I determine if the applet has access to the internet and/or the parent project. Unless the applet specifically needs access, such as to download a file at runtime, it's best to answer no:

Access Permissions
If you request these extra permissions for your app, users will see this fact
when launching your app, and certain other restrictions will apply. For more
information, see
https://documentation.dnanexus.com/developer/apps/app-permissions.

Access to the Internet (other than accessing the DNAnexus API).
Will this app need access to the Internet? [y/N]: n

Direct access to the parent project. This is not needed if your app
specifies outputs,     which will be copied into the project after it's done
running.
Will this app need access to the parent project? [y/N]: n

Lastly, I must specify a default instance type. The prompt includes an abbreviated list of instance types. The final number indicates the number of cores, e.g., _x4 indicates 4 cores. The greater the number of cores, the more available memory and disk space. In this case, a small 4-core instance is sufficient:

Default instance type: The instance type you select here will apply to
all entry points in your app unless you override it. See https://documenta
tion.dnanexus.com/developer/api/running-analyses/instance-types for more
information.
Choose an instance type for your app [mem1_ssd1_v2_x4]:

The user is always free to override the instance type using the --instance-type option to dx run.

The final output from dx-app-wizard is a summary of the files that are created:

*** Generating DNAnexus App Template... ***

Your app specification has been written to the dxapp.json file. You can
specify more app options by editing this file directly (see
https://documentation.dnanexus.com/developer for complete documentation).

Created files:
     samtools_count/Readme.developer.md # 1
     samtools_count/Readme.md  # 2
     samtools_count/dxapp.json  # 3
     samtools_count/resources/  # 4
     samtools_count/src/  # 5
     samtools_count/src/samtools_count.sh # 6
     samtools_count/test/  # 7

App directory created!  See https://documentation.dnanexus.com/developer for
tutorials on how to modify these files, or run "dx build samtools_count" or
"dx build --create-app samtools_count" while logged in with dx.

Running the DNAnexus build utility will create an executable on the DNAnexus
platform.  Any files found in the resources directory will be uploaded
so that they will be present in the root directory when the executable is run.

This file should contain applet implementation details.
This file should contain user help.
The answers from dx-app-wizard are used to create the app metadata.
The resources directory is for any additional files you want available on the runtime instance.
The src (pronounced "source") is a conventional place for source code, but it's not a requirement that code lives in this directory.
This is the bash script that will be executed when the applet is run.
The test directory is empty and will not be discussed in this section.

The contents of the resources directory will be placed into the root directory of the runtime instance. For instance, if you create a file resources/my_tool, then it will be available on the runtime instance as /my_tool. You would either need to reference the full path (/my_tool) or expand the $PATH variable to include /. Best practice is to create the directory structure resources/usr/local/bin/, and then the file will be at /usr/local/bin/my_tool as /usr/local/bin normally part of $PATH.

Reading dxapp.json

Let's look at the dxapp.json that was generated by dx-app-wizard. Note that this is a simple text file that you can edit at any time:

{
    "name": "samtools_count",
    "title": "Samtools Count",
    "summary": "Count SAM/BAM alignments",
    "dxapi": "1.0.0",
    "version": "0.0.1",

The inputSpec has a section for patterns where I will add a few Unix file globs to indicate acceptable file suffix:

    "inputSpec": [
        {
            "name": "bam",
            "label": "BAM File",
            "class": "file",
            "optional": false,
            "patterns": [
                "*.bam"
            ],
            "help": ""
        }
    ],

The outputSpec needs no update:

    "outputSpec": [
        {
            "name": "counts",
            "label": "Counts File",
            "class": "file",
            "patterns": [
                "*"
            ],
            "help": ""
        }
    ],

The runSpec contains the timeout along with the indication to use bash to run src/samtools_count.sh. If you ever wanted to change the name or location of the run script, update this section:

    "runSpec": {
        "timeoutPolicy": {
            "*": {
                "hours": 48
            }
        },
        "interpreter": "bash",
        "file": "src/samtools_count.sh",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0"
    },

Finally, the regionalOptions indicates the default runtime instance.

    "regionalOptions": {
        "aws:us-east-1": {
            "systemRequirements": {
                "*": {
                    "instanceType": "mem1_ssd1_v2_x4"
                }
            }
        }
    }
}

Installing Applet Dependencies

In the preceeding runSpec, note that the applet will run on Ubuntu 20.04. This instance will include dx-toolkit and several programming languages including bash, Python 3.x, Perl 5.x, and R 3.x. Anything else needed by the applet must be installed. Edit the runSpec to include the following execDepends to install samtools at runtime using the apt package manger:

{
    ...
    "runSpec": {
        "execDepends": [
            {
                "name": "samtools",
                "package_manager": "apt"
            }
        ],
        ...
    }
}

The package_manager may be one of the following:

apt (Ubuntu)
pip (Python)
gem (Ruby)
cpan (Perl)
cran (R)

Some caveats:

This runs apt install every execution, which is fine for fast installs. Some packages may take 5-15 minutes to install, in which case you will pay for those extra minutes on every run.
Installs current version in the package manager, which may be old. For instance, apt install v1.10 as of this writing while the current version is v1.17.
Your applet could break if the program changes if the package manager updates to a newer version.

Building An Asset

An alternative is to build an asset that the applet uses. Assets have many advantages, including:

Build asset once
Runtime installs are quick decompression of tarballs
Assets are static and cannot break your code

Create a new folder with the name of your asset.

Then, create the file dxasset.json in the folder with the following contents:

{
    "name": "samtools",
    "title": "samtools asset",
    "description": "samtools asset",
    "version": "1.10",
    "distribution": "Ubuntu",
    "release": "20.04",
    "execDepends": [
        {
          "name": "samtools",
          "package_manager": "apt"
        }
    ]
}

When I execute dx build_asset in the folder, a new job will run to build the asset:

$ dx build_asset
...
* samtools (create_asset_focal:main) (done) job-GXjx8yj071x69xBVz90Zypx1
  kyclark 2023-07-14 16:04:27 (runtime 0:02:05)
  Output: asset_bundle = record-GXjx9V008bgjZqj82f5ybf16

Asset bundle 'record-GXjx9V008bgjZqj82f5ybf16' is built and can now be used
in your app/applet's dxapp.json

As noted, the record ID of the asset can now be used in an assetDepends section, which should replace the execDepends:

{
    ...
    "runSpec": {
        "assetDepends": [
            { "id": "record-GXjx9V008bgjZqj82f5ybf16" }
        ],
        ...
    }
}

Execute dx build_asset inside this directory to build the asset into the selected project. (You can also use the --destination option to specify where to place the asset file, which will be a tarball.)

The build process will create a new job to build the asset.

Writing Applet Code

The default src/samtools_count.sh contains many lines of comments to guide you in writing your application code. Update the file to the following:

#!/bin/bash 

main() { 
    echo "Value of bam: '$bam'" 

    dx download "$bam" -o input.bam 

    samtools view -c input.bam > counts.txt 

    counts_id=$(dx upload counts.txt --brief) 

    dx-jobutil-add-output counts "$counts_id" --class=file 
}

This is the colloquially named "shebang" line that indicates this is a bash script.
Althought it's not a requirement that app code be contained in a main() function, it is best practice.
The original template uses echo to show you the runtime value of the inputs.
Download the input file.
Execute samtools to count the alignments in the input file.
Upload the results file and save the new file ID.
Add the new file ID to the job's output.

Remember that the $bam variable matches the name of the input in dxapp.json. If you ever wish to change this, be sure to update both the script and the JSON.

Building the Applet

Run dx build to create the applet on the DNAnexus platform.

$ dx build
{"id": "applet-GXqG4Z8071x9p1FZ81K5BjGQ"}

If you have previous built the applet, you will be prompted to use the flags -f|--overwrite or -a|--archive flags:

$ dx build
Error: ('An applet already exists at /samtools_count (id
applet-GXqG4Z8071x9p1FZ81K5BjGQ) and neither -f/--overwrite
nor -a/--archive were given.',)

As habit, I always use -f to force the build:

$ dx build -f
INFO:dxpy:Deleting applet(s) applet-GXqG4Z8071x9p1FZ81K5BjGQ
{"id": "applet-GXqG5P0071xF2j1F03qv7Zz6"}

Without the -d|--destination option, the applet will be placed into the root directory of the project. I like to make an apps folder to hold my applets:

$ dx mkdir apps
$ dx build -d /apps/ -f
{"id": "applet-GXqG7bQ071xKQq3JkbVjGbGv"}

TIP: Best practice is to create folders for applets, resources, assets, etc.

Executing the Applet

Understanding the Code

I'd like to discuss this code a little more. In bash, the echo command will print to the console. As in any language, this is a great way to see what's happening when your code is running. In the following line, the $bam variable will only have a value at runtime, so you will not be able to run this script locally:

echo "Value of bam: '$bam'"

When I execute this code, I see output like the following:

2023-07-17 12:42:23 Samtools Count STDOUT Value of bam:
'{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}'

That means that the following line:

dx download "$bam" -o input.bam

Will execute the following command at runtime:

dx download '{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}' -o input.bam

Take a look at the usage for dx download to remind yourself that the -o option here is directing that the output file name be input.bam:

-o OUTPUT, --output OUTPUT Local filename or directory to be used
                           ("-" indicates stdout output); if not supplied or
                           a directory is given, the object's name on the
                           platform will be used, along with any applicable
                           extensions

The next line of code executes samtools view with the -c. Execute samtools view -h to read the documentation:

-c, --count                Print only the count of matching records

I often use a cloud workstation to work through app building. It's the same execution environment (Ubuntu Linux), so I will install any programs I need there, download sample input files, run commands and verify the behavior and output of the tools, etc.

If I download the input file NA12878.bam (file-FpQKQk00FgkGV3Vb3jJ8xqGV), I can run the following command to see that there are 60,777 aligments:

$ samtools view -c NA12878.bam
60777

I can use Unix output redirection with > to place the output into the file counts.txt and cat to verify the output:

$ samtools view -c NA12878.bam > counts.txt
$ cat counts.txt
60777

Therefore, the following line of code from the bash script place the count of the input BAM file into counts.txt:

samtools view -c input.bam > counts.txt

Next, I upload the counts.txt file to the platform using the --brief option that will only show the new file ID:

$ dx upload counts.txt --brief
file-GXpvky0071x6jg2ZVV3fJ5xp

In bash, I can use either backticks (``) or $() to capture the results from a command, so the following line captures the file ID into the variable counts_id:

$ counts_id=$(dx upload counts.txt --brief)
$ echo $counts_id
file-GXqFf60071x6p2fbKYzVv9pp

I use add this new file ID as an output from the job using dx-jobutil-add-output:

$ dx-jobutil-add-output -h
usage: dx-jobutil-add-output [-h] [--class [CLASSNAME]] [--array] name value

Reads and modifies job_output.json in your home directory to be a JSON hash
with key *name* and value  *value*.

If --class is not provided or is set to "auto", auto-detection of the
output format will occur.  In particular, it will treat it as a number,
hash, or boolean if it can be successfully parsed as such.  If it is a
string which matches the pattern for a data object ID, it will encapsulate
it in a DNAnexus link hash; otherwise it is treated as a simple string.

Here is the last command of the script that sets the counts output variable defined in the dxapp.json to the new $counts_id value:

dx-jobutil-add-output counts "$counts_id" --class=file

Using Input File Helper Variables

In the preceeding applet, the output filename is always counts.txt. It would be better for each output file to use the name of the input BAM. When I defined the bam input, I get four variables:

bam: the input file ID
bam_path: the default path to the downloaded input file
bam_name: the filename, also the output of basename($bam_path)
bam_prefix: the filename minus any file extension defined in the patterns of the dxapp.json

The default patterns for a file input in dxapp.json is ["*"]. This matches the entire input filename, causing bam_prefix to be the empty string.

TIP: Always be sure to set patterns to the expected file extensions.

Given an input file of NA12878.bam, the following code will create an output file called NA12878.txt:

#!/bin/bash

main() {
    echo "Value of bam       : '$bam'" # 1
    echo "Value of bam_path  : '$bam_path'" 
    echo "Value of bam_name  : '$bam_name'"
    echo "Value of bam_prefix: '$bam_prefix'"

    dx download "$bam" -o "$bam_name"  # 2

    outfile="$bam_prefix.txt"  # 3

    samtools view -c "$bam_name" > "$outfile"  # 4

    counts_id=$(dx upload "$outfile" --brief)  # 5

    dx-jobutil-add-output counts "$counts_id" --class=file # 6
}

Print out the additional variables.
Download the input file to the filename. The -o option here is superfluous as the default behavior is to download the file to it's filename. In the preceeding example, I saved it to the filename input.bam.
Define the variable outfile to use root of the input filename.
Write the output from samtools to the preferred output filename.
Upload the output file.

When I run this code, I can see the values of the other input file variables:

Value of bam       : '{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}'
Value of bam_path  : '/home/dnanexus/in/bam/NA12878.bam'
Value of bam_name  : 'NA12878.bam'
Value of bam_prefix: 'NA12878'

The bam_path value is the default path to write the bam file if I were to use dx-download-all-inputs. In this case, I used dx download with the -o option to write it to a file in the current working directory, so there is no file at that path.

Using dx-download-all-inputs

There are two ways to download the input files: one at a time or all at once. So far, I've shown the first way using dx download. The second way uses dx-download-all-inputs to download all the input files to the directory /home/dnanexus/in. This will contain a directory for each file input, so the bam input file will be placed into /home/dnanexus/in/bam as shown for the $bam_path in the preceeding section. If the input is an array:file, there will be additional numbered subdirectories for each of the runtime values.

Following is the usage:

$ dx-download-all-inputs -h
usage: dx-download-all-inputs [-h] [--except EXCLUDE]
  [--parallel] [--sequential]

Note: this is a utility for use by bash apps running in the DNAnexus Platform.

Downloads all files that were supplied as inputs to the app.  By
convention, if an input parameter "FOO" has value

    {"$dnanexus_link": "file-xxxx"}

and filename INPUT.TXT, then the linked file will be downloaded into the
path:

    $HOME/in/FOO/INPUT.TXT

If an input is an array of files, then all files will be placed into
numbered subdirectories under a parent directory named for the input. For
example, if the input key is FOO, and the inputs are {A, B, C}.vcf then,
the directory structure will be:

    $HOME/in/FOO/0/A.vcf
                 1/B.vcf
                 2/C.vcf

Zero padding is used to ensure argument order. For example, if there are 12
input files {A, B, C, D, E, F, G, H, I, J, K, L}.txt, the directory
structure will be:

    $HOME/in/FOO/00/A.vcf
                 ...
                 11/L.vcf

This allows using shell globbing (FOO/*/*.vcf) to get all the files in the
input order.

options:
  -h, --help        show this help message and exit
  --except EXCLUDE  Do not download the input with this name. (May be used
                    multiple times.)
  --parallel        Download the files in parallel
  --sequential      Download the files sequentially

I can change my code to use this:

#!/bin/bash

main() {
    echo "Value of bam       : '$bam'"
    echo "Value of bam_path  : '$bam_path'"
    echo "Value of bam_name  : '$bam_name'"
    echo "Value of bam_prefix: '$bam_prefix'"

    dx-download-all-inputs # 1

    outfile="$bam_prefix.txt" # 2

    samtools view -c "$bam_path" > "$outfile" 

    counts_id=$(dx upload "$outfile" --brief)

    dx-jobutil-add-output counts "$counts_id" --class=file
}

Download the input file to the default location.
Use the $bam_prefix variable (e.g., NA12878) to create the outfile.
Use the $bam_path variable to execute samtools with the path to the in directory.

TIP: Using dx-download-all-inputs --parallel is best practice to download all input files as fast as possible.

Resources

Full Documentation

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

PreviousExample 2: fastq_quality_trimmer NextExample 4: cnvkit

Last updated 4 months ago

Was this helpful?