Example 3: samtools
Building a Native Applet with Bash
Using dx-app-wizard to Create An Applet
In this applet, I'll show how to count the number of reads in a SAM or BAM file using samtools. The SAM format (Sequence Alignment Map) is a tab-delimited text description for sequence alignments, and the BAM format is the same data but stored in binary format for more compression. As the SAM format uses a line break to delineate each record, counting the alignments could be as simple as using wc -l; however, the BAM format requires a program like samtools to read the input file, so I'll show how to install this into the applet's execution environment.
A minimal native applet requires just two files that exist in a directory with the same name as the applet:
dxapp.json: a JSON-formatted metadata file
a bash or Python program to execute
I'll use dx-app-wizard to create a skeleton applet structure with these files:
$ dx-app-wizard
DNAnexus App Wizard, API v1.0.0
Basic Metadata
Please enter basic metadata fields that will be used to describe your app.
Optional fields are denoted by options with square brackets. At the end of
this wizard, the files necessary for building your app will be generated from
the answers you provide.First, I must give my applet a name. The prompt shows that I must use only letters, numbers, a dot, underscore, and a dash. As stated earlier, this applet name will also be the name of the directory, and I'll use samtools_count:
Next, I'm asked for the title. Note that the prompt includes empty square brackets ([]), which contain the default value if I press Enter. As title is not required, it contains the empty string, but I will provide an informative title:
Likewise, the summary field is not required:
The version is also optional, and I will press Enter to take the default:
Input Specification
This applet requires a single input, as shows in Table 1.
bam
BAM File
file
No
NA
When prompted for the first input, I'll enter the following:
The name of the input will be used as a variable in the bash code, so I will use only letters, numbers, and underscores as in
bamorbam_file.The label is optional, as noted by the empty square brackets.
The types include primitives like integers, floating-point numbers, and strings, as well as arrays of primitive types.
This is a required input. If an input is optional, I can also provide a default value.
When prompted for the second input, press Enter:
Output Specification
As showing in Table 2, the applet will produce a single output file containing the number of alignments:
counts
Counts File
file
When prompted for the first output name, I enter the following:
This name will also become a bash variable, so best practice is to use letters, numbers, and underscores.
The label is optional.
The class must be from the preceeding list. To be reminded of the choices, press the Tab key twice.
When prompted for the second output, press Enter:
Additional Settings
Here are the final settings I'll use to complete the wizard:
Timeout Policy
48h
Programming language
bash
Access to internet
No (default)
Access to parent project
No (default)
Instance Type
mem1_ssd1_v2_x4 (default)
Applets are required to set a maximum time for running to prevent a job from running an excessively long time. While some applets may legitimately need days to run, most probably need something in the range of 12-48 hours. As noted in the prompt, I can use m, h, or d to specify minutes, hours, or days, respectively:
For the template language, I must select from bash or Python for the program that is executed when the applet starts. The applet code can execute any program available in the execution environment, including custom programs written in any language. I will choose bash:
Next, I determine if the applet has access to the internet and/or the parent project. Unless the applet specifically needs access, such as to download a file at runtime, it's best to answer no:
Lastly, I must specify a default instance type. The prompt includes an abbreviated list of instance types. The final number indicates the number of cores, e.g., _x4 indicates 4 cores. The greater the number of cores, the more available memory and disk space. In this case, a small 4-core instance is sufficient:
The user is always free to override the instance type using the --instance-type option to dx run.
The final output from dx-app-wizard is a summary of the files that are created:
This file should contain applet implementation details.
This file should contain user help.
The answers from
dx-app-wizardare used to create the app metadata.The resources directory is for any additional files you want available on the runtime instance.
The src (pronounced "source") is a conventional place for source code, but it's not a requirement that code lives in this directory.
This is the bash script that will be executed when the applet is run.
The test directory is empty and will not be discussed in this section.
The contents of the resources directory will be placed into the root directory of the runtime instance. For instance, if you create a file resources/my_tool, then it will be available on the runtime instance as /my_tool. You would either need to reference the full path (/my_tool) or expand the $PATH variable to include /. Best practice is to create the directory structure resources/usr/local/bin/, and then the file will be at /usr/local/bin/my_tool as /usr/local/bin normally part of $PATH.
Reading dxapp.json
Let's look at the dxapp.json that was generated by dx-app-wizard. Note that this is a simple text file that you can edit at any time:
The inputSpec has a section for patterns where I will add a few Unix file globs to indicate acceptable file suffix:
The outputSpec needs no update:
The runSpec contains the timeout along with the indication to use bash to run src/samtools_count.sh. If you ever wanted to change the name or location of the run script, update this section:
Finally, the regionalOptions indicates the default runtime instance.
Installing Applet Dependencies
In the preceeding runSpec, note that the applet will run on Ubuntu 20.04. This instance will include dx-toolkit and several programming languages including bash, Python 3.x, Perl 5.x, and R 3.x. Anything else needed by the applet must be installed. Edit the runSpec to include the following execDepends to install samtools at runtime using the apt package manger:
The package_manager may be one of the following:
apt(Ubuntu)pip(Python)gem(Ruby)cpan(Perl)cran(R)
Some caveats:
This runs
apt installevery execution, which is fine for fast installs. Some packages may take 5-15 minutes to install, in which case you will pay for those extra minutes on every run.Installs current version in the package manager, which may be old. For instance,
aptinstall v1.10 as of this writing while the current version is v1.17.Your applet could break if the program changes if the package manager updates to a newer version.
Building An Asset
An alternative is to build an asset that the applet uses. Assets have many advantages, including:
Build asset once
Runtime installs are quick decompression of tarballs
Assets are static and cannot break your code
Create a new folder with the name of your asset.
Then, create the file dxasset.json in the folder with the following contents:
When I execute dx build_asset in the folder, a new job will run to build the asset:
As noted, the record ID of the asset can now be used in an assetDepends section, which should replace the execDepends:
Execute dx build_asset inside this directory to build the asset into the selected project. (You can also use the --destination option to specify where to place the asset file, which will be a tarball.)
The build process will create a new job to build the asset.
Writing Applet Code
The default src/samtools_count.sh contains many lines of comments to guide you in writing your application code. Update the file to the following:
This is the colloquially named "shebang" line that indicates this is a bash script.
Althought it's not a requirement that app code be contained in a
main()function, it is best practice.The original template uses
echoto show you the runtime value of the inputs.Download the input file.
Execute
samtoolsto count the alignments in the input file.Upload the results file and save the new file ID.
Add the new file ID to the job's output.
Remember that the $bam variable matches the name of the input in dxapp.json. If you ever wish to change this, be sure to update both the script and the JSON.
Building the Applet
Run dx build to create the applet on the DNAnexus platform.
If you have previous built the applet, you will be prompted to use the flags -f|--overwrite or -a|--archive flags:
As habit, I always use -f to force the build:
Without the -d|--destination option, the applet will be placed into the root directory of the project. I like to make an apps folder to hold my applets:
TIP: Best practice is to create folders for applets, resources, assets, etc.
Executing the Applet
Understanding the Code
I'd like to discuss this code a little more. In bash, the echo command will print to the console. As in any language, this is a great way to see what's happening when your code is running. In the following line, the $bam variable will only have a value at runtime, so you will not be able to run this script locally:
When I execute this code, I see output like the following:
That means that the following line:
Will execute the following command at runtime:
Take a look at the usage for dx download to remind yourself that the -o option here is directing that the output file name be input.bam:
The next line of code executes samtools view with the -c. Execute samtools view -h to read the documentation:
I often use a cloud workstation to work through app building. It's the same execution environment (Ubuntu Linux), so I will install any programs I need there, download sample input files, run commands and verify the behavior and output of the tools, etc.
If I download the input file NA12878.bam (file-FpQKQk00FgkGV3Vb3jJ8xqGV), I can run the following command to see that there are 60,777 aligments:
I can use Unix output redirection with > to place the output into the file counts.txt and cat to verify the output:
Therefore, the following line of code from the bash script place the count of the input BAM file into counts.txt:
Next, I upload the counts.txt file to the platform using the --brief option that will only show the new file ID:
In bash, I can use either backticks (``) or $() to capture the results from a command, so the following line captures the file ID into the variable counts_id:
I use add this new file ID as an output from the job using dx-jobutil-add-output:
Here is the last command of the script that sets the counts output variable defined in the dxapp.json to the new $counts_id value:
Using Input File Helper Variables
In the preceeding applet, the output filename is always counts.txt. It would be better for each output file to use the name of the input BAM. When I defined the bam input, I get four variables:
bam: the input file IDbam_path: the default path to the downloaded input filebam_name: the filename, also the output ofbasename($bam_path)bam_prefix: the filename minus any file extension defined in thepatternsof the dxapp.json
The default patterns for a file input in dxapp.json is ["*"]. This matches the entire input filename, causing bam_prefix to be the empty string.
TIP: Always be sure to set patterns to the expected file extensions.
Given an input file of NA12878.bam, the following code will create an output file called NA12878.txt:
Print out the additional variables.
Download the input file to the filename. The
-ooption here is superfluous as the default behavior is to download the file to it's filename. In the preceeding example, I saved it to the filename input.bam.Define the variable
outfileto use root of the input filename.Write the output from
samtoolsto the preferred output filename.Upload the output file.
When I run this code, I can see the values of the other input file variables:
The bam_path value is the default path to write the bam file if I were to use dx-download-all-inputs. In this case, I used dx download with the -o option to write it to a file in the current working directory, so there is no file at that path.
Using dx-download-all-inputs
There are two ways to download the input files: one at a time or all at once. So far, I've shown the first way using dx download. The second way uses dx-download-all-inputs to download all the input files to the directory /home/dnanexus/in. This will contain a directory for each file input, so the bam input file will be placed into /home/dnanexus/in/bam as shown for the $bam_path in the preceeding section. If the input is an array:file, there will be additional numbered subdirectories for each of the runtime values.
Following is the usage:
I can change my code to use this:
Download the input file to the default location.
Use the
$bam_prefixvariable (e.g., NA12878) to create theoutfile.Use the
$bam_pathvariable to executesamtoolswith the path to the in directory.
TIP: Using dx-download-all-inputs --parallel is best practice to download all input files as fast as possible.
Resources
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Last updated
Was this helpful?