Example 3: cnvkit

This example will build on the asset you created in the bash version. You will:

Learn how to download the input type array:file
Use regular expressions to classify output files

Getting Started

We'll call our new applet python_cnvkit. If you want to start from dx-app-wizard, use the following specs for the inputs and outputs:

Input Name

Type

Optional

Default Value

bam_tumor

array:file

reference

file

The output specs are as follows:

Output Name

Type

cns

array:file

cns_filtered

array:file

plot

array:file

You can also copy the bash applet directory and update the runSpec in dxapp.json to run a Python script and use the CNVKit asset from before:

    "runSpec": {
        "timeoutPolicy": {
            "*": {
                "hours": 48
            }
        },
        "interpreter": "python3",
        "file": "src/python_cnvkit.py",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0",
        "assetDepends": [{"id": "record-GgP33b00BppJKpyyFxGpZJYf"}],
    }

Here is the input.json:

{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}

Python Code

Update src/python_cnvkit.py to the following:

python_cnvkit.py

#!/usr/bin/env python

import os
import dxpy
import re
import sys
from typing import List
from subprocess import getstatusoutput


@dxpy.entry_point("main")
def main(bam_tumor, reference):
    bam_tumor = [dxpy.DXFile(item) for item in bam_tumor] # 1

    reference = dxpy.DXFile(reference) # 2
    reference_name = reference.describe().get("name", "reference.cnn")
    dxpy.download_dxfile(reference.get_id(), reference_name)

    bam_dir = "bams"
    os.makedirs(bam_dir)

    bam_files = [] # 3
    for file in bam_tumor:
        desc = file.describe()
        file_id = file.get_id()
        path = os.path.join(bam_dir, desc.get("name", file_id))
        dxpy.download_dxfile(file_id, path) # 4
        bam_files.append(path)

    out_dir = "cnvkit-out"
    cmd = (
        f"cnvkit.py batch {' '.join(bam_files)} "
        f"-r {reference_name} "
        f"-p $(expr $(nproc) - 1) "
        f"-d {out_dir} --scatter"
    )
    print(cmd)

    rv, out = getstatusoutput(cmd) # 5
    if rv != 0:
        sys.exit(out)

    out_files = [os.path.join(out_dir, file) for file in os.listdir(out_dir)] # 6
    print('out_files = {",".join(out_files)}')

    return {
        "cns": upload("\.call\.cns$", out_files), # 7
        "cns_filtered": upload("(?<!\.call)\.cns$", out_files),
        "plot": upload("-scatter.png$", out_files),
    }


def upload(pattern: str, paths: List[str]) -> List[str]:
    """Upload files matching a pattern and return DX link"""

    regex = re.compile(pattern) # 8
    return [
        dxpy.dxlink(dxpy.upload_local_file(file)) # 9
        for file in filter(regex.search, paths) # 10
    ]


dxpy.run()

Use a Python list comprehension to generate a list of file IDs for the tumor BAM files.
Download the reference file.
Initialize a list to hold the download BAM paths.
Download each BAM file into a directory and append the path to the bam_files list.
Create, print, and run the command to execute CNVkit.
Find all the files created in the output directory. The os.listdir function only returns the filenames, so append the directory name.
For each of the output file categories, filter the output files and upload the output files matching the expected extension.
Compile the given regular expression.
Create a DX file ID link for each uploaded file.
Filter the given files for those matching the regex.

NOTE: The regex (?<!.call).cns$ uses a negative lookbehind to ensure that .call is not preceding .cns.

Here is the output from the job:

Job Log
-------
Watching job job-GgP7Z30071x73vpBzXK1jk7X. Press Ctrl+C to stop watching.
* CNVKit (python_cnvkit:main) (running) job-GgP7Z30071x73vpBzXK1jk7X
  kyclark 2024-02-27 17:10:52 (running for 0:01:57)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (priority)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (bulk)
2024-02-27 17:13:34 CNVKit INFO Downloading bundled file cnvkit_asset.tar.gz
2024-02-27 17:14:02 CNVKit STDOUT >>> Unpacking cnvkit_asset.tar.gz to /
2024-02-27 17:14:02 CNVKit STDERR tar: Removing leading `/' from member names
2024-02-27 17:15:36 CNVKit INFO Setting SSH public key
2024-02-27 17:15:39 CNVKit STDOUT dxpy/0.369.0
(Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-27 17:15:40 CNVKit STDOUT Invoking main with {'bam_tumor':
[{'$dnanexus_link': 'file-GFxXjV006kZVQPb20G85VXBp'}], 'reference':
{'$dnanexus_link': 'file-GFxXvpj06kZfP0QVKq2p2FGF'}}
2024-02-27 17:16:16 CNVKit STDOUT Running "cnvkit.py batch
bams/HCC1187_1x_tumor_markdup.bam -r reference.cnn -p $(expr $(nproc) - 1) -d
cnvkit-out --scatter"
2024-02-27 17:19:57 CNVKit STDOUT out_files = {",".join(out_files)}
* CNVKit (python_cnvkit:main) (done) job-GgP7Z30071x73vpBzXK1jk7X
  kyclark 2024-02-27 17:10:52 (runtime 0:07:54)
  Output: cns = [ file-GgP7jF80K7VPVpkkkzyqBK2Q ]
          cns_filtered = [ file-GgP7jF80K7V7q1jJVPYJj0pg, 
                           file-GgP7jFQ0K7VFfb7BJ3YbYy60 ]
          plot = [ file-GgP7jFQ0K7V115GPfGYB2j6b ]

Review

You used a for loop to download multiple input BAM files into a local directory.
You used regular expressions to classify the output files into the three output labels.

Resources

Full Documentation

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

PreviousExample 2: fastq_quality_trimmer NextPublishing Applets to Apps

Last updated 5 months ago

Was this helpful?