Example 3: samtools
Last updated
Was this helpful?
Last updated
Was this helpful?
In this applet, I'll show how to count the number of reads in a SAM or BAM file using samtools
. The (Sequence Alignment Map) is a tab-delimited text description for sequence alignments, and the BAM format is the same data but stored in binary format for more compression. As the SAM format uses a line break to delineate each record, counting the alignments could be as simple as using wc -l
; however, the BAM format requires a program like samtools
to read the input file, so I'll show how to install this into the applet's execution environment.
A minimal native applet requires just two files that exist in a directory with the same name as the applet:
dxapp.json: a JSON-formatted
a bash or Python program to execute
I'll use dx-app-wizard
to create a skeleton applet structure with these files:
First, I must give my applet a name. The prompt shows that I must use only letters, numbers, a dot, underscore, and a dash. As stated earlier, this applet name will also be the name of the directory, and I'll use samtools_count:
Next, I'm asked for the title. Note that the prompt includes empty square brackets ([]
), which contain the default value if I press Enter. As title is not required, it contains the empty string, but I will provide an informative title:
Likewise, the summary field is not required:
The version is also optional, and I will press Enter to take the default:
This applet requires a single input, as shows in Table 1.
bam
BAM File
file
No
NA
When prompted for the first input, I'll enter the following:
The name of the input will be used as a variable in the bash code, so I will use only letters, numbers, and underscores as in bam
or bam_file
.
The label is optional, as noted by the empty square brackets.
The types include primitives like integers, floating-point numbers, and strings, as well as arrays of primitive types.
This is a required input. If an input is optional, I can also provide a default value.
When prompted for the second input, press Enter:
As showing in Table 2, the applet will produce a single output file containing the number of alignments:
counts
Counts File
file
When prompted for the first output name, I enter the following:
This name will also become a bash variable, so best practice is to use letters, numbers, and underscores.
The label is optional.
The class must be from the preceeding list. To be reminded of the choices, press the Tab key twice.
When prompted for the second output, press Enter:
Here are the final settings I'll use to complete the wizard:
Timeout Policy
48h
Programming language
bash
Access to internet
No (default)
Access to parent project
No (default)
Instance Type
mem1_ssd1_v2_x4 (default)
Applets are required to set a maximum time for running to prevent a job from running an excessively long time. While some applets may legitimately need days to run, most probably need something in the range of 12-48 hours. As noted in the prompt, I can use m, h, or d to specify minutes, hours, or days, respectively:
For the template language, I must select from bash or Python for the program that is executed when the applet starts. The applet code can execute any program available in the execution environment, including custom programs written in any language. I will choose bash:
Next, I determine if the applet has access to the internet and/or the parent project. Unless the applet specifically needs access, such as to download a file at runtime, it's best to answer no:
The user is always free to override the instance type using the --instance-type
option to dx run
.
The final output from dx-app-wizard
is a summary of the files that are created:
This file should contain applet implementation details.
This file should contain user help.
The answers from dx-app-wizard
are used to create the app metadata.
The resources directory is for any additional files you want available on the runtime instance.
The src (pronounced "source") is a conventional place for source code, but it's not a requirement that code lives in this directory.
This is the bash script that will be executed when the applet is run.
The test directory is empty and will not be discussed in this section.
The contents of the resources directory will be placed into the root directory of the runtime instance. For instance, if you create a file resources/my_tool, then it will be available on the runtime instance as /my_tool. You would either need to reference the full path (/my_tool) or expand the $PATH
variable to include /
. Best practice is to create the directory structure resources/usr/local/bin/, and then the file will be at /usr/local/bin/my_tool as /usr/local/bin normally part of $PATH
.
Let's look at the dxapp.json that was generated by dx-app-wizard
. Note that this is a simple text file that you can edit at any time:
The inputSpec has a section for patterns where I will add a few Unix file globs to indicate acceptable file suffix:
The outputSpec needs no update:
The runSpec contains the timeout along with the indication to use bash to run src/samtools_count.sh. If you ever wanted to change the name or location of the run script, update this section:
Finally, the regionalOptions indicates the default runtime instance.
In the preceeding runSpec, note that the applet will run on Ubuntu 20.04. This instance will include dx-toolkit
and several programming languages including bash, Python 3.x, Perl 5.x, and R 3.x. Anything else needed by the applet must be installed. Edit the runSpec to include the following execDepends to install samtools
at runtime using the apt
package manger:
The package_manager may be one of the following:
apt
(Ubuntu)
pip
(Python)
gem
(Ruby)
cpan
(Perl)
cran
(R)
Some caveats:
This runs apt install
every execution, which is fine for fast installs. Some packages may take 5-15 minutes to install, in which case you will pay for those extra minutes on every run.
Installs current version in the package manager, which may be old. For instance, apt
install v1.10 as of this writing while the current version is v1.17.
Your applet could break if the program changes if the package manager updates to a newer version.
An alternative is to build an asset that the applet uses. Assets have many advantages, including:
Build asset once
Runtime installs are quick decompression of tarballs
Assets are static and cannot break your code
Create a new folder with the name of your asset.
Then, create the file dxasset.json in the folder with the following contents:
When I execute dx build_asset
in the folder, a new job will run to build the asset:
As noted, the record ID of the asset can now be used in an assetDepends section, which should replace the execDepends:
Execute dx build_asset
inside this directory to build the asset into the selected project. (You can also use the --destination
option to specify where to place the asset file, which will be a tarball.)
The build process will create a new job to build the asset.
The default src/samtools_count.sh contains many lines of comments to guide you in writing your application code. Update the file to the following:
This is the colloquially named "shebang" line that indicates this is a bash script.
Althought it's not a requirement that app code be contained in a main()
function, it is best practice.
The original template uses echo
to show you the runtime value of the inputs.
Download the input file.
Execute samtools
to count the alignments in the input file.
Upload the results file and save the new file ID.
Add the new file ID to the job's output.
Remember that the $bam
variable matches the name of the input in dxapp.json. If you ever wish to change this, be sure to update both the script and the JSON.
Run dx build
to create the applet on the DNAnexus platform.
If you have previous built the applet, you will be prompted to use the flags -f|--overwrite
or -a|--archive
flags:
As habit, I always use -f
to force the build:
Without the -d|--destination
option, the applet will be placed into the root directory of the project. I like to make an apps folder to hold my applets:
TIP: Best practice is to create folders for applets, resources, assets, etc.
I'd like to discuss this code a little more. In bash, the echo
command will print to the console. As in any language, this is a great way to see what's happening when your code is running. In the following line, the $bam
variable will only have a value at runtime, so you will not be able to run this script locally:
When I execute this code, I see output like the following:
That means that the following line:
Will execute the following command at runtime:
Take a look at the usage for dx download
to remind yourself that the -o
option here is directing that the output file name be input.bam:
The next line of code executes samtools view
with the -c
. Execute samtools view -h
to read the documentation:
I often use a cloud workstation to work through app building. It's the same execution environment (Ubuntu Linux), so I will install any programs I need there, download sample input files, run commands and verify the behavior and output of the tools, etc.
If I download the input file NA12878.bam (file-FpQKQk00FgkGV3Vb3jJ8xqGV
), I can run the following command to see that there are 60,777 aligments:
I can use Unix output redirection with >
to place the output into the file counts.txt and cat
to verify the output:
Therefore, the following line of code from the bash script place the count of the input BAM file into counts.txt:
Next, I upload the counts.txt file to the platform using the --brief
option that will only show the new file ID:
In bash, I can use either backticks (``) or $()
to capture the results from a command, so the following line captures the file ID into the variable counts_id
:
I use add this new file ID as an output from the job using dx-jobutil-add-output
:
Here is the last command of the script that sets the counts
output variable defined in the dxapp.json to the new $counts_id
value:
In the preceeding applet, the output filename is always counts.txt. It would be better for each output file to use the name of the input BAM. When I defined the bam
input, I get four variables:
bam
: the input file ID
bam_path
: the default path to the downloaded input file
bam_name
: the filename, also the output of basename($bam_path)
bam_prefix
: the filename minus any file extension defined in the patterns
of the dxapp.json
The default patterns
for a file input in dxapp.json is ["*"]
. This matches the entire input filename, causing bam_prefix
to be the empty string.
TIP: Always be sure to set patterns
to the expected file extensions.
Given an input file of NA12878.bam, the following code will create an output file called NA12878.txt:
Print out the additional variables.
Download the input file to the filename. The -o
option here is superfluous as the default behavior is to download the file to it's filename. In the preceeding example, I saved it to the filename input.bam.
Define the variable outfile
to use root of the input filename.
Write the output from samtools
to the preferred output filename.
Upload the output file.
When I run this code, I can see the values of the other input file variables:
The bam_path
value is the default path to write the bam
file if I were to use dx-download-all-inputs
. In this case, I used dx download
with the -o
option to write it to a file in the current working directory, so there is no file at that path.
There are two ways to download the input files: one at a time or all at once. So far, I've shown the first way using dx download
. The second way uses dx-download-all-inputs
to download all the input files to the directory /home/dnanexus/in. This will contain a directory for each file input, so the bam
input file will be placed into /home/dnanexus/in/bam as shown for the $bam_path
in the preceeding section. If the input is an array:file
, there will be additional numbered subdirectories for each of the runtime values.
Following is the usage:
I can change my code to use this:
Download the input file to the default location.
Use the $bam_prefix
variable (e.g., NA12878) to create the outfile
.
Use the $bam_path
variable to execute samtools
with the path to the in directory.
TIP: Using dx-download-all-inputs --parallel
is best practice to download all input files as fast as possible.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Lastly, I must specify a default instance type. The prompt includes an abbreviated list of . The final number indicates the number of cores, e.g., _x4
indicates 4 cores. The greater the number of cores, the more available memory and disk space. In this case, a small 4-core instance is sufficient: