# Example 3: cnvkit

This example will build on the asset you created in the `bash` version. You will:

* Learn how to download the input type `array:file`
* Use regular expressions to classify output files

## Getting Started

We'll call our new applet *python\_cnvkit*. If you want to start from `dx-app-wizard`, use the following specs for the inputs and outputs:

| Input Name  | Type         | Optional | Default Value |
| ----------- | ------------ | -------- | ------------- |
| `bam_tumor` | `array:file` | No       | NA            |
| `reference` | `file`       | No       | NA            |

The output specs are as follows:

| Output Name    | Type         |
| -------------- | ------------ |
| `cns`          | `array:file` |
| `cns_filtered` | `array:file` |
| `plot`         | `array:file` |

You can also copy the `bash` applet directory and update the `runSpec` in *dxapp.json* to run a Python script and use the CNVKit asset from before:

```bash
    "runSpec": {
        "timeoutPolicy": {
            "*": {
                "hours": 48
            }
        },
        "interpreter": "python3",
        "file": "src/python_cnvkit.py",
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0",
        "assetDepends": [{"id": "record-GgP33b00BppJKpyyFxGpZJYf"}],
    }
```

Here is the *input.json*:

```json
{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}
```

## Python Code

Update *src/python\_cnvkit.py* to the following:

{% code title="python\_cnvkit.py" overflow="wrap" lineNumbers="true" %}

```python
#!/usr/bin/env python

import os
import dxpy
import re
import sys
from typing import List
from subprocess import getstatusoutput


@dxpy.entry_point("main")
def main(bam_tumor, reference):
    bam_tumor = [dxpy.DXFile(item) for item in bam_tumor] # 1

    reference = dxpy.DXFile(reference) # 2
    reference_name = reference.describe().get("name", "reference.cnn")
    dxpy.download_dxfile(reference.get_id(), reference_name)

    bam_dir = "bams"
    os.makedirs(bam_dir)

    bam_files = [] # 3
    for file in bam_tumor:
        desc = file.describe()
        file_id = file.get_id()
        path = os.path.join(bam_dir, desc.get("name", file_id))
        dxpy.download_dxfile(file_id, path) # 4
        bam_files.append(path)

    out_dir = "cnvkit-out"
    cmd = (
        f"cnvkit.py batch {' '.join(bam_files)} "
        f"-r {reference_name} "
        f"-p $(expr $(nproc) - 1) "
        f"-d {out_dir} --scatter"
    )
    print(cmd)

    rv, out = getstatusoutput(cmd) # 5
    if rv != 0:
        sys.exit(out)

    out_files = [os.path.join(out_dir, file) for file in os.listdir(out_dir)] # 6
    print('out_files = {",".join(out_files)}')

    return {
        "cns": upload("\.call\.cns$", out_files), # 7
        "cns_filtered": upload("(?<!\.call)\.cns$", out_files),
        "plot": upload("-scatter.png$", out_files),
    }


def upload(pattern: str, paths: List[str]) -> List[str]:
    """Upload files matching a pattern and return DX link"""

    regex = re.compile(pattern) # 8
    return [
        dxpy.dxlink(dxpy.upload_local_file(file)) # 9
        for file in filter(regex.search, paths) # 10
    ]


dxpy.run()
```

{% endcode %}

1. Use a Python [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) to generate a list of file IDs for the tumor BAM files.
2. Download the reference file.
3. Initialize a list to hold the download BAM paths.
4. Download each BAM file into a directory and append the path to the `bam_files` list.
5. Create, print, and run the command to execute CNVkit.
6. Find all the files created in the output directory. The [`os.listdir`](https://docs.python.org/3/library/os.html#os.listdir) function only returns the filenames, so append the directory name.
7. For each of the output file categories, filter the output files and upload the output files matching the expected extension.
8. Compile the given regular expression.
9. Create a DX file ID link for each uploaded file.
10. [Filter](https://docs.python.org/3/library/functions.html#filter) the given files for those matching the regex.

NOTE: The regex *(?\<!.call).cns$* uses a [negative lookbehind](https://www.regular-expressions.info/lookaround.html) to ensure that *.call* is not preceding *.cns*.

Here is the output from the job:

```bash
Job Log
-------
Watching job job-GgP7Z30071x73vpBzXK1jk7X. Press Ctrl+C to stop watching.
* CNVKit (python_cnvkit:main) (running) job-GgP7Z30071x73vpBzXK1jk7X
  kyclark 2024-02-27 17:10:52 (running for 0:01:57)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (priority)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (bulk)
2024-02-27 17:13:34 CNVKit INFO Downloading bundled file cnvkit_asset.tar.gz
2024-02-27 17:14:02 CNVKit STDOUT >>> Unpacking cnvkit_asset.tar.gz to /
2024-02-27 17:14:02 CNVKit STDERR tar: Removing leading `/' from member names
2024-02-27 17:15:36 CNVKit INFO Setting SSH public key
2024-02-27 17:15:39 CNVKit STDOUT dxpy/0.369.0
(Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-27 17:15:40 CNVKit STDOUT Invoking main with {'bam_tumor':
[{'$dnanexus_link': 'file-GFxXjV006kZVQPb20G85VXBp'}], 'reference':
{'$dnanexus_link': 'file-GFxXvpj06kZfP0QVKq2p2FGF'}}
2024-02-27 17:16:16 CNVKit STDOUT Running "cnvkit.py batch
bams/HCC1187_1x_tumor_markdup.bam -r reference.cnn -p $(expr $(nproc) - 1) -d
cnvkit-out --scatter"
2024-02-27 17:19:57 CNVKit STDOUT out_files = {",".join(out_files)}
* CNVKit (python_cnvkit:main) (done) job-GgP7Z30071x73vpBzXK1jk7X
  kyclark 2024-02-27 17:10:52 (runtime 0:07:54)
  Output: cns = [ file-GgP7jF80K7VPVpkkkzyqBK2Q ]
          cns_filtered = [ file-GgP7jF80K7V7q1jJVPYJj0pg, 
                           file-GgP7jFQ0K7VFfb7BJ3YbYy60 ]
          plot = [ file-GgP7jFQ0K7V115GPfGYB2j6b ]
```

## Review

* You used a `for` loop to download multiple input BAM files into a local directory.
* You used regular expressions to classify the output files into the three output labels.

## Resources

[Full Documentation](https://documentation.dnanexus.com/)

To create a support ticket if there are technical issues:

1. Go to the Help header (same section where Projects and Tools are) inside the platform
2. Select "Contact Support"
3. Fill in the Subject and Message to submit a support ticket.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.dnanexus.com/buildingapplets/python/python_cnvkit.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
