Example 3: cnvkit
This example will build on the asset you created in the bash
version. You will:
Learn how to download the input type
array:file
Use regular expressions to classify output files
Getting Started
We'll call our new applet python_cnvkit. If you want to start from dx-app-wizard
, use the following specs for the inputs and outputs:
bam_tumor
array:file
No
NA
reference
file
No
NA
The output specs are as follows:
cns
array:file
cns_filtered
array:file
plot
array:file
You can also copy the bash
applet directory and update the runSpec
in dxapp.json to run a Python script and use the CNVKit asset from before:
"runSpec": {
"timeoutPolicy": {
"*": {
"hours": 48
}
},
"interpreter": "python3",
"file": "src/python_cnvkit.py",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0",
"assetDepends": [{"id": "record-GgP33b00BppJKpyyFxGpZJYf"}],
}
Here is the input.json:
{
"bam_tumor": [
{
"$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
}
],
"reference": {
"$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
}
}
Python Code
Update src/python_cnvkit.py to the following:
#!/usr/bin/env python
import os
import dxpy
import re
import sys
from typing import List
from subprocess import getstatusoutput
@dxpy.entry_point("main")
def main(bam_tumor, reference):
bam_tumor = [dxpy.DXFile(item) for item in bam_tumor] # 1
reference = dxpy.DXFile(reference) # 2
reference_name = reference.describe().get("name", "reference.cnn")
dxpy.download_dxfile(reference.get_id(), reference_name)
bam_dir = "bams"
os.makedirs(bam_dir)
bam_files = [] # 3
for file in bam_tumor:
desc = file.describe()
file_id = file.get_id()
path = os.path.join(bam_dir, desc.get("name", file_id))
dxpy.download_dxfile(file_id, path) # 4
bam_files.append(path)
out_dir = "cnvkit-out"
cmd = (
f"cnvkit.py batch {' '.join(bam_files)} "
f"-r {reference_name} "
f"-p $(expr $(nproc) - 1) "
f"-d {out_dir} --scatter"
)
print(cmd)
rv, out = getstatusoutput(cmd) # 5
if rv != 0:
sys.exit(out)
out_files = [os.path.join(out_dir, file) for file in os.listdir(out_dir)] # 6
print('out_files = {",".join(out_files)}')
return {
"cns": upload("\.call\.cns$", out_files), # 7
"cns_filtered": upload("(?<!\.call)\.cns$", out_files),
"plot": upload("-scatter.png$", out_files),
}
def upload(pattern: str, paths: List[str]) -> List[str]:
"""Upload files matching a pattern and return DX link"""
regex = re.compile(pattern) # 8
return [
dxpy.dxlink(dxpy.upload_local_file(file)) # 9
for file in filter(regex.search, paths) # 10
]
dxpy.run()
Use a Python list comprehension to generate a list of file IDs for the tumor BAM files.
Download the reference file.
Initialize a list to hold the download BAM paths.
Download each BAM file into a directory and append the path to the
bam_files
list.Create, print, and run the command to execute CNVkit.
Find all the files created in the output directory. The
os.listdir
function only returns the filenames, so append the directory name.For each of the output file categories, filter the output files and upload the output files matching the expected extension.
Compile the given regular expression.
Create a DX file ID link for each uploaded file.
Filter the given files for those matching the regex.
NOTE: The regex (?<!.call).cns$ uses a negative lookbehind to ensure that .call is not preceding .cns.
Here is the output from the job:
Job Log
-------
Watching job job-GgP7Z30071x73vpBzXK1jk7X. Press Ctrl+C to stop watching.
* CNVKit (python_cnvkit:main) (running) job-GgP7Z30071x73vpBzXK1jk7X
kyclark 2024-02-27 17:10:52 (running for 0:01:57)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (priority)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (bulk)
2024-02-27 17:13:34 CNVKit INFO Downloading bundled file cnvkit_asset.tar.gz
2024-02-27 17:14:02 CNVKit STDOUT >>> Unpacking cnvkit_asset.tar.gz to /
2024-02-27 17:14:02 CNVKit STDERR tar: Removing leading `/' from member names
2024-02-27 17:15:36 CNVKit INFO Setting SSH public key
2024-02-27 17:15:39 CNVKit STDOUT dxpy/0.369.0
(Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-27 17:15:40 CNVKit STDOUT Invoking main with {'bam_tumor':
[{'$dnanexus_link': 'file-GFxXjV006kZVQPb20G85VXBp'}], 'reference':
{'$dnanexus_link': 'file-GFxXvpj06kZfP0QVKq2p2FGF'}}
2024-02-27 17:16:16 CNVKit STDOUT Running "cnvkit.py batch
bams/HCC1187_1x_tumor_markdup.bam -r reference.cnn -p $(expr $(nproc) - 1) -d
cnvkit-out --scatter"
2024-02-27 17:19:57 CNVKit STDOUT out_files = {",".join(out_files)}
* CNVKit (python_cnvkit:main) (done) job-GgP7Z30071x73vpBzXK1jk7X
kyclark 2024-02-27 17:10:52 (runtime 0:07:54)
Output: cns = [ file-GgP7jF80K7VPVpkkkzyqBK2Q ]
cns_filtered = [ file-GgP7jF80K7V7q1jJVPYJj0pg,
file-GgP7jFQ0K7VFfb7BJ3YbYy60 ]
plot = [ file-GgP7jFQ0K7V115GPfGYB2j6b ]
Review
You used a
for
loop to download multiple input BAM files into a local directory.You used regular expressions to classify the output files into the three output labels.
Resources
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Last updated
Was this helpful?