Academy Documentation
  • Usage of Academy Documentation
  • Getting Started
    • Background Information
    • For Apollo Users
    • For Titan Users
    • For Scientists
    • For HPC Users
    • For Experienced Users
  • Cloud Computing
    • General Information
    • Cloud Computing for Scientists
    • Cloud Computing for HPC Users
  • Overview of the Platform
    • Overview of the Platform User Interface
    • Tool Library and App Introduction
  • Billing Access and Orgs
    • Orgs and Account Management
    • Billing and Pricing
  • Cohort Browser
    • Apollo Introduction
    • Overview of the Cohort Browser
    • Combining Cohorts
    • Genomic Variant Browser
    • Somatic Variants
  • JSON
    • Introduction
    • JSON on the Platform
  • Command Line Interface (CLI)
    • Introduction to CLI
    • Advanced CLI
  • Building Applets
    • Introduction
    • Bash
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: samtools
      • Example 4: cnvkit
      • Example 5: samtools with a Docker Image
    • Python
      • Example 1: Word Count (wc)
      • Example 2: fastq_quality_trimmer
      • Example 3: cnvkit
    • Publishing Applets to Apps
  • Building Workflows
    • Native Workflows
    • WDL
      • Example 1: hello
      • Example 2: Word Count (wc)
      • Example 3: fastq_trimmer
      • Example 4: cnvkit
      • Example 5: workflow
    • Nextflow
      • Resources To Learn Nextflow
      • Overview of Nextflow
      • Nextflow Setup
      • Importing Nf-Core
      • Building Nextflow Applets
      • Error Strategies for Nextflow
      • Job Failures
      • Useful Information
  • Interactive Cloud Computing
    • Cloud Workstation
    • TTYD
    • TTYD vs Cloud Workstation
    • JupyterLab
      • Introduction
      • Running a JupyterLab Notebook
  • Docker
    • Using Docker
    • Creating Docker Snapshots
    • Running Docker with Swiss Army Knife
  • Portals
    • Overview of JSON files for Portals
    • Branding JSON File
    • Home JSON File
    • Navigation JSON File
    • Updating Your Portal
  • AI/ ML Accelerator
    • Data Profiler
      • Introduction to Data Profiler
      • Utilizing Data Profiler Navigator
      • Dataset Level Screen
      • Table Level Screen
      • Column Level Screen
      • Explorer Mode
      • Accessing Data Profiler in ML JupyterLab
    • ML JupyterLab
      • Introduction to ML JupyterLab
      • Launching a ML JupyterLab Job
      • In App Features
      • Getting Started with ML JupyterLab
    • MLflow
      • Introduction to MLflow
      • Getting Started with MLflow
      • Using MLflow Tracking Server
      • Model Registry
      • Using Existing Model
      • Utilizing MLflow in JupyterLab
Powered by GitBook
On this page
  • Getting Started
  • Python Code
  • Review
  • Resources

Was this helpful?

Export as PDF
  1. Building Applets
  2. Python

Example 3: cnvkit

This example will build on the asset you created in the bash version. You will:

  • Learn how to download the input type array:file

  • Use regular expressions to classify output files

Getting Started

We'll call our new applet python_cnvkit. If you want to start from dx-app-wizard, use the following specs for the inputs and outputs:

Input Name
Type
Optional
Default Value

bam_tumor

array:file

No

NA

reference

file

No

NA

The output specs are as follows:

Output Name
Type

cns

array:file

cns_filtered

array:file

plot

array:file

You can also copy the bash applet directory and update the runSpec in dxapp.json to run a Python script and use the CNVKit asset from before:

    "runSpec": {
        "timeoutPolicy": {
            "*": {
                "hours": 48
            }
        },
        "interpreter": "python3",
        "file": "src/python_cnvkit.py",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0",
        "assetDepends": [{"id": "record-GgP33b00BppJKpyyFxGpZJYf"}],
    }

Here is the input.json:

{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}

Python Code

Update src/python_cnvkit.py to the following:

python_cnvkit.py
#!/usr/bin/env python

import os
import dxpy
import re
import sys
from typing import List
from subprocess import getstatusoutput


@dxpy.entry_point("main")
def main(bam_tumor, reference):
    bam_tumor = [dxpy.DXFile(item) for item in bam_tumor] # 1

    reference = dxpy.DXFile(reference) # 2
    reference_name = reference.describe().get("name", "reference.cnn")
    dxpy.download_dxfile(reference.get_id(), reference_name)

    bam_dir = "bams"
    os.makedirs(bam_dir)

    bam_files = [] # 3
    for file in bam_tumor:
        desc = file.describe()
        file_id = file.get_id()
        path = os.path.join(bam_dir, desc.get("name", file_id))
        dxpy.download_dxfile(file_id, path) # 4
        bam_files.append(path)

    out_dir = "cnvkit-out"
    cmd = (
        f"cnvkit.py batch {' '.join(bam_files)} "
        f"-r {reference_name} "
        f"-p $(expr $(nproc) - 1) "
        f"-d {out_dir} --scatter"
    )
    print(cmd)

    rv, out = getstatusoutput(cmd) # 5
    if rv != 0:
        sys.exit(out)

    out_files = [os.path.join(out_dir, file) for file in os.listdir(out_dir)] # 6
    print('out_files = {",".join(out_files)}')

    return {
        "cns": upload("\.call\.cns$", out_files), # 7
        "cns_filtered": upload("(?<!\.call)\.cns$", out_files),
        "plot": upload("-scatter.png$", out_files),
    }


def upload(pattern: str, paths: List[str]) -> List[str]:
    """Upload files matching a pattern and return DX link"""

    regex = re.compile(pattern) # 8
    return [
        dxpy.dxlink(dxpy.upload_local_file(file)) # 9
        for file in filter(regex.search, paths) # 10
    ]


dxpy.run()
  1. Download the reference file.

  2. Initialize a list to hold the download BAM paths.

  3. Download each BAM file into a directory and append the path to the bam_files list.

  4. Create, print, and run the command to execute CNVkit.

  5. For each of the output file categories, filter the output files and upload the output files matching the expected extension.

  6. Compile the given regular expression.

  7. Create a DX file ID link for each uploaded file.

Here is the output from the job:

Job Log
-------
Watching job job-GgP7Z30071x73vpBzXK1jk7X. Press Ctrl+C to stop watching.
* CNVKit (python_cnvkit:main) (running) job-GgP7Z30071x73vpBzXK1jk7X
  kyclark 2024-02-27 17:10:52 (running for 0:01:57)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (priority)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (bulk)
2024-02-27 17:13:34 CNVKit INFO Downloading bundled file cnvkit_asset.tar.gz
2024-02-27 17:14:02 CNVKit STDOUT >>> Unpacking cnvkit_asset.tar.gz to /
2024-02-27 17:14:02 CNVKit STDERR tar: Removing leading `/' from member names
2024-02-27 17:15:36 CNVKit INFO Setting SSH public key
2024-02-27 17:15:39 CNVKit STDOUT dxpy/0.369.0
(Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-27 17:15:40 CNVKit STDOUT Invoking main with {'bam_tumor':
[{'$dnanexus_link': 'file-GFxXjV006kZVQPb20G85VXBp'}], 'reference':
{'$dnanexus_link': 'file-GFxXvpj06kZfP0QVKq2p2FGF'}}
2024-02-27 17:16:16 CNVKit STDOUT Running "cnvkit.py batch
bams/HCC1187_1x_tumor_markdup.bam -r reference.cnn -p $(expr $(nproc) - 1) -d
cnvkit-out --scatter"
2024-02-27 17:19:57 CNVKit STDOUT out_files = {",".join(out_files)}
* CNVKit (python_cnvkit:main) (done) job-GgP7Z30071x73vpBzXK1jk7X
  kyclark 2024-02-27 17:10:52 (runtime 0:07:54)
  Output: cns = [ file-GgP7jF80K7VPVpkkkzyqBK2Q ]
          cns_filtered = [ file-GgP7jF80K7V7q1jJVPYJj0pg, 
                           file-GgP7jFQ0K7VFfb7BJ3YbYy60 ]
          plot = [ file-GgP7jFQ0K7V115GPfGYB2j6b ]

Review

  • You used a for loop to download multiple input BAM files into a local directory.

  • You used regular expressions to classify the output files into the three output labels.

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

PreviousExample 2: fastq_quality_trimmerNextPublishing Applets to Apps

Last updated 4 months ago

Was this helpful?

Use a Python to generate a list of file IDs for the tumor BAM files.

Find all the files created in the output directory. The function only returns the filenames, so append the directory name.

the given files for those matching the regex.

NOTE: The regex (?<!.call).cns$ uses a to ensure that .call is not preceding .cns.

list comprehension
os.listdir
Filter
negative lookbehind
Full Documentation