1 of 100

Academy Documentation

Usage of Academy Documentation

Welcome to DNAnexus Academy's online guidebook! This resource is designed for educational purposes to provide you with a foundational understanding of how to utilize DNAnexus for performing analyses. Please note that this guide does not aim to instruct you on every aspect of using the platform, nor does it suggest that this is the only method for leveraging DNAnexus solutions. Instead, it serves as an instructional tool with examples designed to help you begin your journey.

Included in this documentation are guides to assist with your projects, including videos, and content for the terms and concepts that we think are important for your understanding. There are also walk-through examples to get you comfortable on the platform.

As-Is Software Disclaimer This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to materials provided hereunder.

Getting Started

Background Information

Welcome to DNAnexus!

Before you go through the information here, there is necessary information that we think will be useful for you to have.

Some of the users of the platform have limited coding experience. As bioinformaticians and computational biologists, we are members of a community that want to help alleviate that stress. On this page, we attached some helpful links and tutorials that will hopefully help make the world of computational biology a bit less intimidating. This is not a partnership or affiliation, but rather a list of what we found useful when we were learning ourselves.

Additionally, users may need resources on the different types of sequencing and the impacts, and we have some here for the ever evolving field of genetics/ genomics. Again, these are not endorsing any particular company, lab, or resource, but instead more of a general guide to help fill in the gaps.

For Apollo Users

If you are an Apollo user, these sections are recommended for you:

Overview of the Platform Billing Access and Orgs Command Line Interface (CLI)Cohort Browser JupyterLab

Any background information that could be necessary are listed in the For HPC or For Scientists pages to get you started there as well.

For Titan Users

If you are a Titan user, these sections are recommended for you:

Any background information that could be necessary are listed in the For HPC or For Scientists pages to get you started there as well.

For Scientists

If you are new to the DNAnexus platform and computational biology/ bioinformatics, these sections are recommended for you:

Background Information General Information Cloud Computing for Scientists Overview of the Platform For Titan Users For Apollo Users

For HPC Users

If you are an HPC user new to the DNAnexus platform , these sections are recommended for you:

Background Information General Information For HPC Users Overview of the Platform Command Line Interface (CLI)JSON For Titan Users For Apollo Users

For Experienced Users

If you are an experienced user new to the DNAnexus platform, these sections are recommended for you:

For Titan Users For Apollo Users JSON Docker

Cloud Computing

General Information

Instance Type Overview

Naming

AWS naming of instance types is broken down here:

Cloud Computing for Scientists

Basic Concept and Terminology

Key Players in Understanding Cloud Computing

Your Computer: When we utilize cloud resources, we as users request them from our own computer using commands from the dx toolkit.
DNAnexus platform: The platform has many working pieces, but we can treat it as one entity here. Our request gets sent to the platform, and given availability, it will grant access to a temporary DNAnexus Worker.
DNAnexus Worker: This temporary worker is the third key player and is where we do our computation on. We'll see that it starts out as a blank slate.

Specific Terms Outside of Key Players

A project contains files and executables and logs associated with analysis securely stored on the platform
The executables on the platform are referred to as apps. Apps are executables that can be run on the DNAnexus platform. Most importantly, they need to contain a software environment to run the executable.
A software environment in general is everything needed to run software on a brand new computer. This includes the software itself that you are needing as well as any dependencies that are needed to run the software. Some examples of dependencies are languages (such as R) that are needed to execute the software.

Project Storage vs Workers

Project storage is permanent, but the workers are temporary. This means that you have to relay information back and forth as shown in the figure below.

The key concept with cloud computing: project storage can be considered as permanent on the platform. Note that workers are temporary. Because workers are temporary, we need to transfer the files we want to process to them. When we are done, we need to transfer any output files back to the project storage. If we don't do this, the files will be lost when we lose access to the worker.

Local vs Cloud Analysis

Local Machines

On your local computer, everything is on your machine.
- This includes your data and the scripts, as well as your software environment and dependencies are also downloaded.
- The results and in between steps are also generated and saved on your machine as well.
You own it and you control it.

Cloud Computing

In comparison, cloud computing adds layers into analysis to increase computational power and storage.
This relationship and the layers involved are in the figure below:
Let's contrast this with the process of processing a file on the DNAnexus platform.

Key Differences

The first difference is that we need to request a worker and we only have temporary access to it. We need to bring everything to the worker, including the software environment.
The second key difference is that we need to bring our files and scripts from project storage to the worker.

Common Challenges with Cloud Computing

Challenge 1: Requesting Enough Resources

Our first barrier is requesting an appropriate worker that can do our computational job.
For example, our app may require more memory, or if it is optimized for working on multiple CPUs, more CPUs.
We need to understand how big our files are and the computing requirements of our software to do this.

Challenge 2: Installing Dependencies

Our second barrier is installing the software environment on the worker, such as R.
Because we are starting from scratch on a worker, we will need ways to reproducibly install the software environment on the worker.
We'll see that this is one of the roles of Apps. As part of their job, they will install the appropriate software environment.

Resolution for Challenge 1 and 2:

There is some good news. If we are running apps, they will handle both of these barriers.
Number one, all apps have a default instance type to use. We'll see that we can tailor this.
Secondly, Apps install the required software environment on their workers.

Challenge 3: Transferring Files

Our third barrier is getting our files onto the worker from project storage, and then doing computations with them on the worker. The last barrier we'll talk about is getting the file outputs we've generated from the worker back into the project storage.
Cloud computing has a nestedness to it and transferring files back and forth can make learning it difficult.
Having a mental model of how cloud computing works can help us overcome these barriers.

Resolution for Challenge 3:

Cloud computing is indirect, and you need to think 2 steps ahead.
Here is the visual for thinking about the steps for file management:

Solution for Challenges: Apps

Apps help you address installing software on worker
Prebuilt software environment that is installed onto the temporary worker
Can build our own apps
Apps serve to (at minimum):

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Overview of the Platform

Setting Up a Project

Projects have a series of features designed to facilitate collaboration, help project members coordinate and organize their work, and ensure appropriate control over both data and tools.

Creating a Project

All work takes place in the context of a project. Projects allow a defined set of users and orgs to:

Access specific data

Metadata

Metadata keeps your data objects and projects organized. All objects that are uploaded or created will have associated metadata with fields such as Name, ID, Path, Status, Class, File Size, Created by, Created, and Modified.

Data Objects

Object Classes include:

Data files

Billing Access and Orgs

Billing and Pricing

Billing

Definition

Billing occurs monthly based on your use of the platform. These invoices are received at the end of the month

The relationship of DNAnexus and billing are highlighted here:

Billing and Charges

What are the charges?

Regions and Pricing can be referred to as the "Rate Card"
These are negotiated at the time of signing
This is the area of expertise of the DNAnexus Sales Account Director. For further details on this, please refer to them.
This can be useful for everyone else to decide the instances that you choose to run on the platform.

Errors and Billing

Job Errors happen
- Some of which are charged to you
- Some of which are not
Error details are found in our

Organizations and Billing

Example: Orgs and Billing

Orgs can be used to consolidate and simplify billing.
An order can be associated with a billing account. This allows all users of the org to build projects and apps to the org billing account.
If you have a project bill to an org, this is useful. Say if you have users within a group or users within a particular lab, that are working with a shared budget, where each member needs to have the ability to work independently within their own project.
By associating a billing account with an org, this allows groups with a shared budget to consolidate all platform activities onto one invoice.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Cohort Browser

Overview of the Cohort Browser

Please note: in order to use Cohort Browser on the Platform, an Apollo License is needed.

A Tour of the Cohort Browser

Purpose of the Cohort Browser

The cohort browser is used for browsing and visualizing data and creating cohorts. These cohorts can then be shared in a project space to your collaborators.

Phenotypic Data Exploration

Please note, the data present in this page is intended for training purposes only. Information about the data present in this documentation is listed here.

The overview tab is dedicated to the phenotypic data that has been ingested in your dataset. The phenotypic data is able to be displayed utilizing tiles, and these tiles will have different tables or figures based on their data type.

Adding Tiles

Simple Tiles

Open Cohort Browser
Select "+ Add Tile" on the top right corner

Find the characteristic you want as a tile and select "Add Tile"

Repeat until you select the amount of tiles that you are wanting (up to 15)

2D Plots

Used for more advanced comparisons
Add comparisons by selecting the first filter, then selecting the "+" sign for a secondary field
Then, edit the data field details

Here is the overview of the 2D plots that are available based on data types:

Steps to Create 2D plots

Open Cohort Browser
Select Add tile on the top right corner

Find the characteristic you are wanting to start with and select it, such as biological sex. This is the same step as adding a regular tile, but you will NOT select Add tile.

Instead, add a secondary field by selecting this next to the second characteristic you are wanting to view.

You will then have options to change the graph with those parameters.

Then, select the add tile button on the bottom right below the new graph. This will add it to the cohort browser.

Limits on the Cohort Browser

Limited to 15 tiles overall in dashboard
Limited to 30 columns in Data Preview
Add 1-2 tiles at a time, wait for them to refresh before adding more tiles.

Germline Data Exploration

Please note, the data present in this page is synthetic data, and is intended for training purposes only. Information about the data present in this documentation is listed.

When germline variant data is present in your data ingestion for the cohort, the Germline Variants tab will appear in the Cohort Browser. The goal of viewing data within the Germline Variants tab is to view germline mutations in genes or genomic regions of interest.

Features of the Germline Variants Tab

Somatic Data Exploration

Please note, the data present is intended for training purposes only. Information about the data present in this documentation is listed .

When somatic variant data is present in your data ingestion for the cohort, the Somatic Variants tab will appear in the Cohort Browser. The goal of viewing data within the Somatic Variants tab is to view somatic mutations present in your data, and to explore variants and events for certain genomic regions. You can also compare these values within 2 different cohorts, as long as they have the same underlying database.

Features of the Somatic Variants Tab

How to Create Cohorts

Phenotypic Filtering

To filter with phenotypic data, you can filter from the tiles that you added in the “Overview” tab, or through the “+ Add Filter” button in the Cohort Banner. These filters allow for assessing the impact of phenotypic/ clinical data and the creation of cohorts.

How to Filter a Dataset with Phenotypic Filters

1. In the Cohort section, select the “+ Add Filter”

2. Search or select your characteristic. Ex: Diagnoses, Tumor Details > Tumor Disease Anatomic Site

Somatic Data Filtering

To filter with somatic data, use the “+ Add Filter” button in the Cohort Banner.

These filters allow for:

Assessing impact of ingested somatic variants in cohorts

Gene Expression Filtering

To filter with gene expression data, you can add a filter based on the tiles created in the Gene Expression tab or use the “+ Add Filter” button in the Cohort Banner.

These filters allow for:

Assessing impact of genes/ features and their expression levels
Building Cohorts based on Gene Expression Level

You Can Filter By:

Gene Symbol or Ensembl ID with Expression Level

Adding Gene Expression Filters

Add in your dataset
Select "+ Add Filter"

Select Assays and then under Gene Expression, select “Features/ Expression”

Select the genes that you want as well as the expression range. Please note, for the Gene/ Feature value, you can select by Gene Symbol or the ENSEMBL ID.

Supplemental Video for Gene Expression Tab and Filtering

JSON

Command Line Interface (CLI)

Building Applets

Bash

In this section, we will build several native bash applets that will increase in complexity:

An applet that takes an input files, runs a single Unix command, and returns the result as a file.
An applet that includes a binary executable file in the resources directory.
An applet that installs the dependency cnvkit

Python

All the same examples from bash now in Python.

Building Workflows

Workflows are a set of 2 or more apps that are linked together by dependencies, or when the output of one app/ applet is the input to another app/applet. A workflow will allow for these apps to be ran after the dependencies are met without having to submit another job (unless there is an error).

We support the following options for building workflows: * Native (GUI) * WDL * Nextflow

In order to kill a job/ workflow/ app/applet you will need to terminate the job/ analysis. Please use dx terminate or terminate in the Monitor tab in the UI.

Native Workflows

What is a Workflow?

The individual apps can be easily combined into pipelines, which are on the DNAnexus platform referenced as workflows.
These apps are linked together by dependencies and can hand off their outputs to other apps as they complete.

WDL

In this section, you will build the same applet examples from bash and Python as tasks, and then graduate to building workflows by chaining tasks together.

Nextflow

Nextflow Setup

In order for Nextflow to run correctly on the platform, please do the following:

Install dxpy/ dx-toolkit. Details on how to do this is in the Command Line Interface Section under Introduction to the CLI.
1. As Nextflow on DNAnexus is being updated with bugfixes and improvements on a regular basis, we recommend updating dxpy to the latest version prior to building your Nextflow applet.
2. You can upgrade dxpy by using the following

Interactive Cloud Computing

JupyterLab

Utilizing a Snapshot

Since the JupyterLab jobs are hosted on a temporary worker, you will either need to download the software packages every time you start a job, or to save a snapshot of the software environment.

What is a Snapshot?

A snapshot saves the current software environment in the JupyterLab environment.

Tips and Tricks for JupyterLab

DX Notebooks Naming

All notebooks saved onto the platform will have a DX prefix in front of them. Here is an example:

Docker

Portals

Updating Your Portal

To upload your files, you will need to do the following:

create a folder with the org name for the portal. It will be org-NAME OF COMMUNITY
Make sure all of your json files are in the folder
Make sure all of your assets/ images are in the folder.

AI/ ML Accelerator

Data Profiler

Gene Expression Data Exploration

Please note, the data present is intended for training purposes only. Information about the data present in this documentation is listed here.

When gene expression data is present in your data ingestion for the cohort, the Gene Expression tab will appear in the Cohort Browser. The goal of viewing data within the Gene Expression tab is to view gene expression values in your cohort, and to compare between 2 cohorts within the same data base.

Features of the Gene Expression Tab

Overview of the Gene Expression Tab

Within the Gene Expression Tab, there are the following sections: plots for Gene Expression, where you can search for genes by Gene Symbol or Ensembl ID, and an Expression per Feature table. The tables and figures of the Gene Expression Tab are highlighted in the figure below:

Gene Expression Plots

To view gene expression for a specific gene, type the gene symbol or Ensembl ID into the search bar for the charts labelled “Expression Level”. There are 3 options for the plots: Expression Level with a box plot, Expression Level with a histogram, and a Feature Correlation scatter plot between 2 genes. More than 3 tiles can be added with the “Add Tile” button, and typing in the Gene Symbol or Ensembl ID.

Expression Level Box Chart

For the box plot, you can see the distribution of the expression level for a given gene by typing in the gene symbol or Ensembl ID to the search bar. You can view the detailed distribution as a violin plot, or as a box plot. The x axis is the distribution of gene expression levels in the cohort and the y axis is the Expression Level. The options to view the detailed distribution are part of the Chart Settings. The Bar Chart with the Violin Plot (detailed distribution) is shown below:

Expression Level Histogram

For the histogram, you can see the distribution of the expression level for a given gene by typing in the gene symbol or Ensembl ID to the search bar. You can see the histogram with or without the display statistics. The x axis is the distribution of gene expression levels in the cohort and the y axis is the Expression Level. The options to view the detailed distribution are part of the Chart Settings. The histogram with the display statistics settings are shown below:

Feature Correlation

For the feature correlation, you can see the expression level for a given gene for the x and y axis by typing in the gene symbol or Ensembl ID to the search bar. You can see the feature correlation with or without the display statistics. The x axis is the gene expression level for one gene and the y axis is the gene expression level for another gene. The options to view the detailed distribution are part of the Chart Settings. The feature correlation with the display statistics settings are shown below:

Introduction

Building Applets

DNAnexus apps and applets are ways to package executable code. The biggest difference between apps and applets is their visibility. Apps such as you find in the Tool Library are globally available and maintained by DNAnexus and partners like Nvidia and Sentieon. Applets are private to an organization and exist as data objects in a project. They can be shared across projects and promoted to generally available apps. Native DNAnexus applets are built using dx build to create an executable for bash or Python code, which in turn may execute any program installed on the instance.

Later, we will discuss how to build a workflow, which is a combination of two or more apps/applets. We will build native workflows using the GUI and languages like WDL (Workflow Description Language) and Nextflow combine with Docker images.

Development Cycle

As shown in following figure, the development cycle is to write code locally, use dx build to create a native applet on the platform, and then dx run to run the applet. You can view the execution logs with dx watch, then make changes to your code to build and run again.

Installing Required Software

To install the Python modules required for this tutorial, run the following command:

You may be prompted to expand PATH with installation directory such as ~/.local/bin:

Next, ensure you have a recent version of Java. For this tutorial, I'm using the following:

If you want to use to execute WDL locally, you should download the Cromwell Jar file. This tutorial assumes you will place this file in your home directory using the following commands.

I suggest you use the link command (ln) to create a symlink to the filename cromwell.jar so that upgrading in the future will not break your commands:

(Workflow Object Model) is also quite useful, and I suggest you similarly download it and link it to womtool.jar:

You will use the DNAnexus to build WDL applications on the platform. Find a link to the latest Jar file under the releases of the . For example, the following commands will download dxCompiler-2.10.4.jar to your home directory and symlink it to dxCompiler.jar:

Some tools may attempt to use the tool to validate any shell code in your WDL. To install on Ubuntu, run the following:

On macOS, you can use to install the program:

The dx CLI

If the dxpy module installed properly, you should able to run dx on the command line. For instance, run dx all to see :

To get started, do the following:

Run to identify yourself to the DNAnexus platform. Enter your username and password. You can also set up a token to log in. Information on setting up tokens can be found in the section of our Documentation.
You may also be prompted to select a project. If not, you should use to select a project that will contain your work.
If you do not see a project you wish to use for your work, run dx new project to create one from the command line, or click "New Project" in the web interface.

Note that each subcommand will respond to the flags -h|--help to display the usage documentation. For instance, dx new can create several object types, which you can discover by reading the documentation:

You should now be prepared to develop DNAnexus apps and workflows.

Resources

Please email to create a support ticket if there are technical issues.

profiles {

    dnanexus {
        executor {
            name = 'local'
            queueSize = 50
        }
        docker {
            enabled = true
        }
    }

    cluster {
        executor {
            name = 'sge'
            memory = '20GB
        }
    }
}

$ dx all
usage: dx [-h] [--version] command ...

DNAnexus Command-Line Client, API v1.0.0, client v0.320.0

dx is a command-line client for interacting with the DNAnexus platform.  You
can log in, navigate, upload, organize and share your data, launch analyses,
and more.  For a quick tour of what the tool can do, see

  https://documentation.dnanexus.com/getting-started/tutorials/cli-quickstart#quickstart-for-cli

For a breakdown of dx commands by category, run "dx help".

dx exits with exit code 3 if invalid input is provided or an invalid operation
is requested, and exit code 1 if an internal error is encountered.  The latter
usually indicate bugs in dx; please report them at

  https://github.com/dnanexus/dx-toolkit/issues

optional arguments:
  -h, --help  show this help message and exit
  --env-help  Display help message for overriding environment
              variables
  --version   show program's version number and exit

dx: error: argument command: invalid choice: all
(choose from login, logout, exit, whoami, env, setenv, clearenv, invite,
uninvite, ls, tree, pwd, select, cd, cp, mv, mkdir, rmdir, rm, describe,
upload, download, make_download_url, cat, head, build, build_asset, add, list,
remove, update, install, uninstall, run, watch, ssh_config, ssh, terminate,
rmproject, new, get_details, set_details, set_visibility, add_types,
remove_types, tag, untag, rename, set_properties, unset_properties, close,
wait, get, find, api, upgrade, generate_batch_inputs,
publish, archive, unarchive, help)

$ dx new -h
usage: dx new [-h] class ...

Use this command with one of the available subcommands (classes) to create a
new project or data object from scratch. Not all data types are supported. See
'dx upload' for files and 'dx build' for applets.

positional arguments:
  class
    user      Create a new user account
    org       Create new non-billable org
    project   Create a new project
    record    Create a new record
    workflow  Create a new workflow

optional arguments:
  -h, --help  show this help message and exit

[
    {
        "project": "project-Gg2QQx002Q7yY4kFQF7GKYPV",
        "id": "applet-G1951vj0YyjJjbvGJ9FZB967",
        "describe": {
            "id": "applet-G1951vj0YyjJjbvGJ9FZB967",
            "project": "project-Gg2QQx002Q7yY4kFQF7GKYPV"
        }
    },
    {
        "project": "project-Gg2QQx002Q7yY4kFQF7GKYPV",
        "id": "file-GGy7Pbj0Xf47XZk125k22g9v",
        "describe": {
            "id": "file-GGy7Pbj0Xf47XZk125k22g9v",
            "project": "project-Gg2QQx002Q7yY4kFQF7GKYPV"
        }
    }
]

Orgs and Account Management

Account Management

Glossary of Terms

User

A single person that is utilizing the platform

Org

Collection of users. They are either admin or member level

Permissions

Gives the user the ability to work on the platform within the scope of what they are needing to access

Project

means of enabling users to collaborate by providing them with shared access to specific data and tools.

Within a project space, there are different permission/access levels. They are:

Most restrictive: view the project, move and copy data across projects.

View and Create folders and modify metadata

Uploader AND run executions

Contributor AND change permissions for users, project ownership, and deletion

Defining Members of an Org and their Relationship

is used to represent a group of users
Can be used to simplify the sharing of projects, apps, and billing
Have members and admins

control the access to

Why Add Users to an Org?

Allows the access to the shared apps
- This is for what the org is an authorized user for
- If the org cannot use the app, the member cannot either
Allow seeing the price column in the UI monitor tab and on the command line

Project Overrides for a User

By default, when a project is created, the settings tab shows the following:
The owner of the project can change these
You may want to restrict them depending on your org policy
Copy access

The org allows for the sharing

of the same resources
- Control the access as stated above
- Org admins can remove and add users
to users performing similar functions

Sharing projects and apps within orgs allows a group of users performing similar functions to be given the same level of access to shared resources.
In this example, there is the org administrator, admin a, who provides view access to the project resources to the org. Additionally, admin A adds users B and C to the org, and also adds admin D to the org.
Admin D then provides upload permissions to the project raw data, and makes the org and authorized user of the QC app. So in addition to being a convenient way to share projects and data, the org aids provide access to apps as well.

Multiple Orgs

You can have multiple people in multiple orgs.

Example 2: Multiple Orgs

Members who are working on two separate projects, and they need access to different data/ apps which have different budgets.
the user may need to create and work on projects that are billed to two step or groups in. This is where creating multiple orgs comes in handy.
Admin D is admin of both org and org-new because admin D needs to work within both of these orgs.
Admin D adds user E to both org and org- new and only adds member or user F to org- new because user only needs to work within org-new.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Tool Library and App Introduction

Before you begin, review the overview documentation and log onto the DNAnexus Platform

Tool Library

The tool library is a set of ready to use apps and work flows that are maintained by DNAnexus

There are different categories and you can search by name of the tool.

Steps to finding Tool Documentation

Navigate to the Tool Library
In the Any Name search box, start entering "FASTQC...."
Click on the tool name, and you will be at the info tab of the tool.
Select the Version: If you want the same version that is loaded automatically, this is all that you will need to do. If you want a different Version, select the Versions tab and select which version you want.

You can also select "Run" to run the app

Tool Runner Options

There are 2 options for running the tool. First, select "Run" where you find the tool documentation.

Then, there are 2 different UIs for setting up the app to run:

The guided set up, which is what you normally start with

Or, the I/O graph

Tool Useful Features

In the Stage settings tab, you can set the version of the app you want to use, instance type and specify the output folder. By specifying the instance type, you will set the computational resources of the machine on which the analysis will be run. For example, if your input data is large, you will choose an instance type with more storage space available.

Required inputs indicated by asterisks, some are optional.

It is point and click.

Can select your instance here.

can be enabled here. At this time, the feature applies to a batch of inputs. The output is aggregated in one output file. (e.g. 10 inputs results in 1 output).

Running and Monitoring an App

Set Up

Once you have selected the app you want to use and read the documentation (if applicable), you will use the guided setup to run the app in the UI.
Set the Output folder
Set the inputs. In the example of FASTQC, it is one FASTQ file
Launch the app using the start analysis button in the upper right

Monitor An App

You will automatically be redirected to the monitor page

When the job is completed, you will have buttons to access the inputs (such as a FASTQ file) and outputs (such as an HTML file).
Here is the view when the app is completed:

Supplemental Information

Using Apps in the GUI

Batch Processing in the GUI

Monitoring An App/ Workflow

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Introduction to JupyterLab

New to JupyterLab?

If you have never used a JupyterLab notebook before, please view this information:

Jupyter Notebook Documentation

Introduction

We can interact with the platform in several different ways and install software packages in these different environments depending on what we are wanting to use and how we want to use it. As shown in the diagram below, we will be explaining Jupyter Lab Python/R/Stata and Spark JupyterLab Python/R:

Why JupyterLab?

Data Scientists’ tasks can be interactive. Options for interactive analysis in JupyterLab are:

Notebook-based Analysis
Exploratory Data Analysis (EDA)
Data Preprocessing/ Cleaning
Implementing New Machine Learning(ML)/ Model

Requesting an Instance

Use Single DXJupyter Instance if:

The work can be done on a single machine instance
Main Use Cases:
Python/R
Image Processing

Use Spark Cluster DXJupyter If:

Working with very large datasets that will not fit in memory on a single instance
Using the Cohort Browser and querying a large ingested dataset
Needing to use Spark based tools such as dxdata, HAIL or GLOW

Starting a JupyterLab Job

Select JupyterLab with Python, R, Stata, ML, Image Processing or JupyterLab from Spark from the Tool Library, or select “Start Analysis” from the project space and select JupyterLab from the tool list. Once selected, press “Run Selected”

Select the output location, and change the job name if desired.

Then, select the inputs you intend on using
1. Snapshot file (not required, and how to create a snapshot is in the Utilizing Snapshot section)
2. Input files (not required, can do in the notebook analysis)
3. Stata settings file (license required for Stata)

Then, press “Start Analysis” in the far right corner

Next, confirm the following parameters:
1. Job Name
2. Output Folder
3. Priority (defaults to normal, can be set to high)

Then, press “Launch Analysis”
When redirected to the monitor tab, select the job name
It will redirect you to the details of the JupyterLab job. Wait for the job to start running, and for the worker URL to appear
Press “Open Worker URL” and the JupyterLab home page will appear

Note: Sometimes, the job is still initializing, so if you press Open Worker URL immediately, it may show a 502 error message. This is okay, and the job will update when the job is finished initializing.

Running instances may take several minutes to load as the allocations become available.

Error Strategies for Nextflow

Nextflow's errorStrategy directive allows you to define how the error condition is managed by the Nextflow executor at the process level.

There are 4 possible strategies:

errorStrategy

Description

terminate (default)

terminate all subjobs as soon as any subjob has an error

finish

when any subjob has an error, do not start any additional subjobs and wait for existing jobs to finish before exiting

ignore

pretend you didn't see it..just report a message that the subjob had an error but continue all subjobs

retry

The DNANexus nextflow documention has a

Generally the errorStrategy is defined in either the base.config (which is referenced using includeConfig in the nextflow.config file) or in the nextflow.config file.

In nfcore pipelines, the default errorStrategy is usually defined in base.config and it is set to 'finish' except for error codes in a specific numeric range which are retried.

The code below is from the

The maxRetries directive allows you to define the maximum number of times the exact same subjob can be re-submitted in case of failure and the maxErrors directive allows you to specify the maximum number of times a process (across all subjobs of that process executed) can fail when using the retry error strategy. .

In the code above, if the exit status of the subjob (task) is within 130 to 145, inclusive, or is equal to 104, then it will retry that subjob once (maxRetries = 1). If other subjobs of the same process also have the same issue, they will also be retried once (maxErrors = '-1' disables the max number of times any process can fail so if every subjob executed for a particular process failed it will allow it to be retried the number of times set in maxRetries). Otherwise, the finish errorStategy is applied and the subjob is terminated pending but other running non-errored subjobs are allowed to complete.

For example, imagine you have a fastqc process that takes in one file at a time from a channel with 3 files (file_A, file_B, file_C)

The process is as below and is run for each file in parallel

fastqc(file_A)
fastqc(file_B)
fastqc(file_C)

If the subjob with file_A and the subjob with file_C fail first with errors in range 130-145 or with a 104 error, they can each be retried once if maxRetries =1 .

Now imagine that you set maxErrors = 2. In this case, there are 3 instances of the process but only 2 errors are allowed for all instances of the process. Thus, it will only retry 2 of the subjobs e.g. fastqc(file_A) fastqc(file_C)

If fastqc(file_B) encounters an error at any point, it won't be retried and then the whole job will go to the finish errorStrategy.

Thus, disabling the maxErrors directive by setting it to '-1' allows all failing subjobs with the specified error codes to be retried X amount of times with X set by maxRetries.

Debugging Checklist for Errors

Check what version of dxpy was used to build the Nextflow pipeline and make sure it is the newest
Look at head-node log (hopefully it was ran with "debug mode" as false because when true, the log gets injected with details which isn't always useful and can make it hard to find errors)
- Look for the process (sub-job) which caused the error, there will be a record of the error log from that process, though it may be truncated

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

{
 "_projects": null, #deletes the current list of projects 
 "_tools": [
  {"text": "Custom Menu Item", "url": "http://example.com"}, #creating a new item within tools 
  {"text": "Opens in New Tab", "url": "http://example.com", "newTab": true} #creating a new tab in tools 
 ],
 "_help": null, #removes help 
 "A New Menu": [
  {"text": "New Menu Item", "url": "http://example.com"}, #new menu 
 ],
 "A New Link": {"url": "http://example.com", "newTab": true} #new link 
}

To be able to resume a run that failed you need to set preserve_cache to true for the initial run. This will cache the nextflow workDir of the run in your project on platform in a folder called .nextflow_cache_db/<session_id>/.

Example 4: cnvkit

There is an existing public Docker image available for CNVkit ("etal/cnvkit:latest"), so another option is to build a WDL version that will download and use this image at runtime rather than installing the Python and R modules ourselves.

In this example, you will:

Use WDL and Docker to build the CNVkit

Getting Started

To start, create a new directory called cnvkit_wdl parallel to the bash directory. Inside this new directory, create the file workflow.wdl with the following contents:

Next, ensure you have a working Java compiler and then download the latest dxCompiler Jar file. You can use the following command to place the 2.10.3 release into your home directory:

Use the dxCompiler to turn workflow.wdl into an applet equivalent to the bash version. In the following command, the workflow and all related applets will be placed into a workflows directory in the given project to keep all this neatly contained. The given the project ID project-GFf2Bq8054J0v8kY8zJ1FGQF is the caris_cnvkit project, so change this to if you wish to place this into a different project. Note the use of the -archive option to archive any existing version of the applet and allow the new version to take precendence and the -reorg to reorganize the output files. As shown in the following command, successful compilation will result in printing the new workflow's ID:

Run the new workflow with the -h|--help flag to verify the inputs:

As with the bash version, you can launch the workflow from the CLI as follows:

The resulting output will show the JSON you can alternatively use to launch the job:

Following is the command you can use to launch the workflow from the CLI with the JSON file:

As before, you can use the web interface to monitor the progress of the workflow and inspect the outputs.

Saving a Docker Image

Run the following command to start a new cloud workstation:

From the cloud workstation, pull the CNVkit Docker image:

Save and compress the image to a file:

Add the tarball to the project:

Update the WDL to use the tarball:

Build the app and run it.

Review

In this chapter, you learned another strategy for packaging an applet's dependencies using Docker and then running the applet's code inside the Docker image using WDL.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Example 2: fastq_quality_trimmer

In this chapter, you'll learn to create an applet that uses the executable from the FASTX-Toolkit collection of command-line tools for processing short-read FASTA and FASTQ files. You'll use the applet to run FastQTrimmer on a FASTQ file, creating a trimmed reads file that you can then use for further analysis.

You will learn the following:

How to accept an optional integer argument from the user
How to add resource files to an applet such as a binary executable that can be used in your applet code

Starting the Applet

Run dx-app-wizard mytrimmer to create the mytrimmer applet. You have already added the app name, so you can press enter when prompted. You can add a title and summary if you would like, as well as version.

Start the input specification with the input FASTQ:

Next, indicate an optional integer for the quality score:

Press Enter to skip a third input and move to the output specification, which should define a single output file:

Press enter to exit the output section.

Set a timeout policy if you would like.

Answer the remaining questions to create a bash applet. The applet does not need access to the internet or parent project, and you can choose the default instance type.

Open the mytrimmer/dxapp.json in a text editor to view the inputSpec:

To make input file selection more convenient for the user, edit the patterns for the file extensions of the input_file to be those commonly used for FASTQ files:

These patterns are used in the web interface to filter files for the user, but it's not a requirement that the input files match these patterns. The file filter can be turned off by the user, so these patterns are merely suggestions.

Adding a Binary Resource

Next, you will add a binary executable file from the FASTX toolkit. Download and unpack the FASTX toolkit binaries:

Then make the executable with the make file. This will create your executable.

The files are also here to download and for you to unpack:

Create the directory resources/usr/bin inside the mytrimmer directory:

When the app is bundled, the directory structure in the resources directory will be compressed and unpacked as is on the instance, so you should create a directory that is in the standard $PATH such as /usr/bin or /usr/local/bin.

This applet only requires the fastq_quality_trimmer binary, so copy it to the preceding directory:

You should remove the downloaded binary artefacts as they are no longer needed.

Writing the Applet

Update mytrimmer/src/mytrimmer.sh with the following code:

The variables $input_file and $input_file_name are based on the inputSpec name input_file. The first is a record-like string {"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}, while the latter is the filename small-celegans-sample.fastq.
The variable $input_file_prefix is the name of the input file without the file extension, so small-celegans-sample, which is used to create the output filename small-celegans-sample.filtered.fastq. See .

You don't need to indicate the full path to fastq_quality_trimmer because it will exist in the directory /usr/local/bin, which is in the standard $PATH.

Creating a Project for the Data and Applet

Add the sample FASTQ file to the project either by using the URL importer as shown in Figure 6, or download the file to your computer and upload through the web interface or using dx upload:

Use dx build to build the applet:

Run the applet with the -h|--help flag from the CLI to see the usage:

Run the applet using the file ID of the FASTA file you uploaded:

The job's output should end with something like the following:

You can select the output file and view the results.

You can download the output file and check that the filtering actually removed some of the input sequences by using wc to count the original file and the result:

Run the applet with a higher quality score and verify that the result includes even fewer sequences.

Review

In this chapter, you learned how to do the following:

Indicate an optional argument with a default value
Add a binary executable to a project in the resources directory and use that binary in your applet
How to use variations on the input file variables to get the full filename or the filename prefix without the extension.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Utilizing Data Profiler Navigator

A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).

A Note on Data:

The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads

The citation for this synthetic dataset is:

Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007

How to Navigate in Data Profiler

Data Profiler helps the user explore different levels of a dataset. There are 3 levels of a dataset in Data Profiler:

Dataset level: Show relationships between tables in the dataset and overview of all tables, columns in the dataset
Table level: Show statistics of one particular table. It can also join with another table to create a joint profile.
Column level: Show statistics of one particular column of a table. It can also combine with other columns in the same table to create a joint profile.

To navigate between these 3 levels, the user can select from a navigator on the left side of the application. Once an option of the navigator is selected, the content of the main interface will change accordingly.

The user interface of Data Profiler consists of a navigator (left, highlighted in blue), which controls the content of the main section (right, highlighted in green).

Navigator

Navigator controls the content on the main section of Data Profiler. The main component of the Navigator is a hierarchical structure of the dataset, called Data Hierarchy

All Tables

The top level of a Data Hierarchy is All Tables, indicating the dataset level. This level is selected by default.

Under All Tables are individual tables in the dataset. Each table has a number on the far right indicating the number of columns in the table.

Data Hierarchy

Once a table is selected, the Data Hierarchy will show all columns from that table. Each column has a colored tag indicating the column type.

Searching for Columns

Above the Data Hierarchy, the user can search for one or more columns. The Data Hierarchy will show tables that have at least one of the column names in the search list (OR logic).

Explorer Mode

At the bottom of the Navigator, the user can switch to an Explorer Mode to create charts on their own. The functionality of this mode is discussed in another section of this document.

The 📜 button shows the Inference Logs Screen that show details on the profiling process. This feature is in development.

Column Types

The type of a column in Data Profiler can be specified in a data_dictionary. If that information is not available, Data Profiler will infer the column type based on the content of the column.

In Data Profiler, there are 4 column types. These types are consistent with the data types used in the via Data Model Loader on DNAnexus platform:

Null (or empty) values are allowed in all column types and they do not affect how a column type is determined.

FAQs about Columns

In my data_dictionary, the type of column A is “integer”. After loading with Data Profiler, the application says column A is a “string” column. What happened?
There is at least one non-null arbitrary value in column A that cannot be cast to an integer. Therefore, the Data Profiler falls back to “string”.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.

Overview of Nextflow

How Nextflow Works Locally

Nextflow pipelines are composed of processes e.g., a task such as fastqc would be one process, then read trimming would be another process etc. Processes pass files between them using channels (queues) so every process usually has an input and output channel. Nextflow is implicitly parallel - if it can run something in parallel, it will! There is no need to loop over channels etc.

For example you could have a script with a fastqc and read_trimming processes which take in a fastq reads channel. As these two process have no links between them they will be run at the same time.

The Nextflow workflow file is called main.nf.

Lets think about a quick workflow that takes in some single-end fastq files, runs fastqc on them, then trims them, runs fastqc again and finally runs multiqc on the fastqc outputs.

An example of code that would achieve the workflow in the image (not showing what each process script looks like here)

An example local run (not on or interacting with DNAnexus) would look like the command below. This assumes you have Nextflow on your own local machine, which is not required for DNAnexus

As we gave --fastq_dir a default, if your inputs match that default you could just run

How Nextflow works on DNAnexus

DNAnexus has developed a version of the Nextflow executor that can orchestrate Nextflow runs on the DNAnexus platform.

Once you kick-off a Nextflow run, a Nextflow 'head-node' is spun up. This stays on for the duration of the run and it spins up and controls the subjobs (each instance of a process).

Head Node

orchestrates subjobs
contains the Nextflow output directory which is usually specified by params.outdir in nfcore pipelines
copies the output directory to the DNAnexus project once all subjobs have completed (--destination)

Subjobs

one for every instance of a process
each subjob is one virtual machine (instance) e.g., fastqc_process(fileA) is run on one machine and fastqc_process(fileB) is run on a different machine
Uses a Docker image for the process environment
Required files pulled onto machine and outputs sent back to head node once subjob completed

Work Directory

Nextflow uses a 'work' directory (workDir) for executing tasks. Each instance of a process gets its own folder in the work directory and this directory stores task execution info, intermediate files etc.
Depending on if you choose to or not, you will be able to see this work directory on the platform during/after your nextflow run.
Otherwise, the work directory exists in a and it will be destroyed once a run has completed.

Note about Batch Processing

You may have learned about batching some inputs for WDL workflows previously. You do not need to do this for Nextflow applets - all parallelisation is done automatically by the Nextflow.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Column Level Screen

A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).

A Note on Data:

The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads

The citation for this synthetic dataset is:

String Column

Column-level screen shows a string column

For columns containing string data, the data profiler will display several statistics and charts to help analyze the data.

The statistics include:

The missing rate, expressed as a percentage of the missing values in the column.
The number of unique values present in the column.

The charts provided include:

Top Records Bar Chart: This chart displays the top values that occur most frequently in the column. You can select how many top records to display using a dropdown list. By hovering over the bars, you can see the exact count of records for each value.
Character Length Distribution Chart: This chart shows how the lengths of the strings are distributed. By hovering over different parts of the chart, you can view the range of character lengths and how frequently each range occurs. Besides, the average length of the strings in the column and standard deviation (which measures the amount of variation in the string lengths) are also reported.
Boxplot: The boxplot provides a visual summary of the data in terms of its distribution, showing the maximum value, Q3 (upper quartile)

Float & Integer

Column-level screen shows a float column

For columns containing float data, the data profiler provides several statistics and charts to help analyze the data.

The statistics include:

The missing rate, displayed as a percentage of missing values.
The standard deviation, which measures the spread of the data values.
The Interquartile range, which measures the difference between the 75th and 25th percentiles of the data.

The charts provided include:

Distribution Chart: This chart displays the distribution of values in the column. You can hover over the chart to view the range of values and their frequencies.
Boxplot: The boxplot visually represents the distribution of the data, showing the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.
Grouping Frequency Chart (Two way plot): This chart shows the frequency of unique values in the current column, grouped with values from another column. You can select the column for grouping from a dropdown list.

Datetime

Column-level screen shows a datetime column

For columns containing datetime data, the data profiler provides several statistics and charts for in-depth analysis.

The statistics include:

The missing rate, displayed as a percentage of missing values.
The standard deviation, measuring the dispersion of the datetime values.
The Mode, showing the mode/format of the datetime data in the column.

The charts provided include:

Distribution Chart: This chart shows the distribution of datetime values in the column. You can hover over the chart to view the range of values and their frequencies.
Boxplot: The boxplot visually represents the distribution of the datetime data, displaying the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.
Radar Chart: This chart displays the frequency of values, grouped by year, month, or day. You can change the grouping option using the dropdown at the top.

Pairwise plot between columns

Even though each column type has a different layout on the Column-level Screen, Pairwise plot between columns is a common component.

The user can create a plot between the current column and any other column from the same table. However, not all columns are available for this feature. Data Profiler will show columns that satisfy the following conditions:

Not a string column
If it is a string column:
- Not a primary key
- The number of unique values count is no larger than 30

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs_ch' -ipreserve_cache=true -iresume='last' -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME'

{
 "header": {
"logo": "#logo_header.png", 
   "logoOpensNewTab": true, 
   "hideCommunitySwitch": true,
   "colors": {
     "background": "#EEEEEE", 
     "border": "#EEEEEE",
     "text": "#000000"  
   }
}, 
"homeURL": "http://academy.dnanexus.com" 
}

{
 "header": {
   "logo": "#logo_header.png",  #image for the logo; has to be an appropriate size. min 15x15px, max 50x30px
   "logoOpensNewTab": true,  #opens new tab if you select the logo 
   "hideCommunitySwitch": true,
   "colors": {
     "background": "#123456", #background color for the header 
     "border": "#123456", #border color for the header
     "text": "#123456", #text color
   }
   }

 "header": {
   "colors": {
     "hoverBackground": "#123456", #hover background color 
     "userColors": ["#123456", "#234567", "#345678"], #user colors 
     "button": {"success": {"border-color": "green", "background":
      "pink", "hover": {"background": "dusk"}}} #setting colors for buttons or hover selections
   }

"register": {
   "disable": true,
   "logo": "#logo_register.png", #image for registering 
   "text": "#ADD TEXT IN MARKDOWN FORMAT HERE.",
   "agreeToText": "Plain text you need to agree to before registering", #plain text, string 
   "region": "aws:us-east-1",
   "colors": {
     "registerButton": "#123456" #color for register button 
   }

"homeURL": "http://example.com", #url for logo 
 "supportURL": "http://example.com/support", #support URL 
 "hideCommunitySwitch": true,
 "description": "A short description of two or three lines for the community selector" #description for the community

# Generate batch file by regex

$ dx generate_batch_inputs -iinput_fwd='(.*)_R1_001.fastq.gz' -iinput_rev='(.*)_R2_001.fastq.gz'

# Show the local file
$ cat dx_batch.0000.tsv

# Use the local batch file
$ dx run fastp --batch-tsv dx_batch.0000.tsv -iadapter_fa=/data/adapters.fa -iprefix='Sample1'

add stage: Add a stage to a workflow

version 1.0

task cnvkit_wdl_kyc {
    input {
        Array[File] bam_tumor
        File reference
    }

    command <<<
        cnvkit.py batch \
            ~{sep=" " bam_tumor} \
            -r ~{reference} \
            -p $(expr $(nproc) -1) \
            -d output/ \
            --scatter
    >>>

    runtime {
        docker: "etal/cnvkit:latest"
        cpu: 16
    }

    output {
        Array[File]+ cns = glob("output/[!.call]*.cns")
        Array[File]+ cns_filtered = glob("output/*.call.cns")
        Array[File]+ plot = glob("output/*-scatter.png")
    }
}

$ java -jar ~/dxCompiler-2.10.3.jar compile workflow.wdl \
        -archive \
        -reorg \
        -folder /workflows \
        -project project-GFf2Bq8054J0v8kY8zJ1FGQF
applet-GFyVxpQ0VGFgGQBy4vJ0kxK2

$ dx run applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 -h
usage: dx run applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 [-iINPUT_NAME=VALUE ...]

Applet: cnvkit_wdl_kyc

Inputs:
  bam_tumor: [-ibam_tumor=(file) [-ibam_tumor=... [...]]]

  reference: -ireference=(file)

 Reserved for dxCompiler
  overrides___: [-ioverrides___=(hash)]

  overrides______dxfiles: [-ioverrides______dxfiles=(file) [-ioverrides______dx>

Outputs:
  cns: cns (array:file)

  cns_filtered: cns_filtered (array:file)

  plot: plot (array:file)

$ dx run -y --watch applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 \
            -ibam_tumor=file-GFxXjV006kZVQPb20G85VXBp \
            -ireference=file-GFxXvpj06kZfP0QVKq2p2FGF \
            --destination project-GFyPxb00VGFz5JZQ4f5x424q:/users/kyclark

$ cat inputs.json
{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}

$ dx upload cnvkit.tar.gz --path project-GFyPxb00VGFz5JZQ4f5x424q:/
[===========================================================>]
Uploaded 503,092,072 of 503,092,072 bytes (100%) cnvkit.tar.gz
ID                    file-GFyq05j0VGFqJqq54q98pbBK
Class                 file
Project               project-GFyPxb00VGFz5JZQ4f5x424q
Folder                /
Name                  cnvkit.tar.gz
State                 closing
Visibility            visible
Types                 -
Properties            -
Tags                  -
Outgoing links        -
Created               Thu Aug 18 03:20:55 2022
Created by            kyclark
 via the job          job-GFypx3Q0VGFgb71g4gYY3GF3
Last modified         Thu Aug 18 03:20:57 2022
Media type
archivalState         "live"
cloudAccount          "cloudaccount-dnanexus"

version 1.0

task cnvkit_wdl_tarball {
    input {
        Array[File] bam_tumor
        File reference
    }

    command <<<
        cnvkit.py batch \
            ~{sep=" " bam_tumor} \
            -r ~{reference} \
            -p $(expr $(nproc) -1) \
            -d output/ \
            --scatter
    >>>

    runtime {
        docker: "dx://file-GFyq05j0VGFqJqq54q98pbBK"
        cpu: 16
    }

    output {
        Array[File]+ cns = glob("output/[!.call]*.cns")
        Array[File]+ cns_filtered = glob("output/*.call.cns")
        Array[File]+ plot = glob("output/*-scatter.png")
    }
}

Input Specification

You will now be prompted for each input parameter to your app.  Each parameter
should have a unique name that uses only the underscore "_" and alphanumeric
characters, and does not start with a number.

1st input name (<ENTER> to finish): input_file
Label (optional human-readable name) []: Input file
Your input parameter must be of one of the following classes:
applet         array:file     array:record   file           int
array:applet   array:float    array:string   float          record
array:boolean  array:int      boolean        hash           string

Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: n

2nd input name (<ENTER> to finish): quality_score
Label (optional human-readable name) []: Quality score
Choose a class (<TAB> twice for choices): int
This is an optional parameter [y/n]: y
A default value should be provided [y/n]: y
  Default value: 30

Output Specification

You will now be prompted for each output parameter of your app.  Each
parameter should have a unique name that uses only the underscore "_" and
alphanumeric characters, and does not start with a number.

1st output name (<ENTER> to finish): output_file
Label (optional human-readable name) []: Output file
Choose a class (<TAB> twice for choices): file

  "inputSpec": [
    {
      "name": "input_file",
      "label": "Input file",
      "class": "file",
      "optional": false,
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "quality_score",
      "label": "Quality score",
      "class": "int",
      "optional": true,
      "default": 30,
      "help": ""
    }
  ],

    {
      "name": "input_file",
      "label": "Input file",
      "class": "file",
      "optional": false,
      "patterns": [
        "*.fastq",
        "*.fq"
      ],
      "help": ""
    }

x ./bin/fasta_clipping_histogram.pl
x ./bin/fasta_formatter
x ./bin/fasta_nucleotide_changer
x ./bin/fastq_masker
x ./bin/fastq_quality_boxplot_graph.sh
x ./bin/fastq_quality_converter
x ./bin/fastq_quality_filter
x ./bin/fastq_quality_trimmer
x ./bin/fastq_to_fasta
x ./bin/fastx_artifacts_filter
x ./bin/fastx_barcode_splitter.pl
x ./bin/fastx_clipper
x ./bin/fastx_collapser
x ./bin/fastx_nucleotide_distribution_graph.sh
x ./bin/fastx_nucleotide_distribution_line_graph.sh
x ./bin/fastx_quality_stats
x ./bin/fastx_renamer
x ./bin/fastx_reverse_complement
x ./bin/fastx_trimmer
x ./bin/fastx_uncollapser

#!/bin/bash

set -exuo pipefail

main() {
    echo "Value of input_file: '$input_file'"
    echo "Value of quality_score: '$quality_score'"

    dx download "$input_file" -o "$input_file_name" 

    outfile="${input_file_prefix}.filtered.fastq" 

    fastq_quality_trimmer -Q 33 -t ${quality_score} -i "$input_file_name" -o "$outfile"

    outfile_id=$(dx upload $outfile --brief) 

    dx-jobutil-add-output output_file "$outfile_id" --class=file 
}

[===========================================================>]
Uploaded 16,801,690 of 16,801,690 bytes (100%) small-celegans-sample.fastq
ID                    file-GJ2k2V80vx88z3zyJbVXZj3G
Class                 file
Project               project-GJ2k24j0vx804FPyBbxqpQBk
Folder                /
Name                  small-celegans-sample.fastq
State                 closing
Visibility            visible
Types                 -
Properties            -
Tags                  -
Outgoing links        -
Created               Tue Oct 11 08:52:37 2022
Created by            kyclark
Last modified         Tue Oct 11 08:52:53 2022
Media type
archivalState         "live"
cloudAccount          "cloudaccount-dnanexus"

$ dx run applet-GJ2k5780vx804FPyBbxqpQQ0 -h
usage: dx run applet-GJ2k5780vx804FPyBbxqpQQ0 [-iINPUT_NAME=VALUE ...]

Applet: FastQTrimmer

mytrimmer

Inputs:
  Input file: -iinput_file=(file)

  Quality score: [-iquality_score=(int, default=30)]

Outputs:
  Output file: output_file (file)

$ dx run applet-GJ2k5780vx804FPyBbxqpQQ0 \
> -iinput_file=file-GJ2k2V80vx88z3zyJbVXZj3G -y --watch

Using input JSON:
{
    "input_file": {
        "$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"
    }
}

Calling applet-GJ2k5780vx804FPyBbxqpQQ0 with output destination
  project-GJ2k24j0vx804FPyBbxqpQBk:/

Job ID: job-GJ2k5F00vx84k2X3BqqZ5Zpp

Job Log
-------
Watching job job-GJ2k5F00vx84k2X3BqqZ5Zpp. Press Ctrl+C to stop watching.

2022-10-11 16:31:18 FastQTrimmer STDERR + echo 'Value of input_file:
'\''{"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}'\'''
2022-10-11 16:31:18 FastQTrimmer STDERR + echo 'Value of quality_score:
'\''30'\'''
2022-10-11 16:31:18 FastQTrimmer STDOUT Value of input_file:
'{"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}'
2022-10-11 16:31:18 FastQTrimmer STDOUT Value of quality_score: '30'
2022-10-11 16:31:18 FastQTrimmer STDERR + dx download '{"$dnanexus_link":
"file-GJ2k2V80vx88z3zyJbVXZj3G"}' -o small-celegans-sample.fastq
2022-10-11 16:31:19 FastQTrimmer STDERR + outfile=
small-celegans-sample.filtered.fastq
2022-10-11 16:31:19 FastQTrimmer STDERR + fastq_quality_trimmer -Q 33
-t 30 -i small-celegans-sample.fastq -o small-celegans-sample.filtered.fastq
2022-10-11 16:31:27 FastQTrimmer STDERR ++ dx upload
small-celegans-sample.filtered.fastq --brief
2022-10-11 16:31:28 FastQTrimmer STDERR + outfile_id=
file-GJ2zkYj06GbzP8XBB4bVGxQ6
2022-10-11 16:31:28 FastQTrimmer STDERR + dx-jobutil-add-output output_file
file-GJ2zkYj06GbzP8XBB4bVGxQ6 --class=file

$ dx download file-GJ2k73j08bbkVxK9Gxx8Z891
[===========================================================>]
Completed 15,557,666 of 15,557,666 bytes (100%) .../fastq_trimmer/small-celegans-sample.filtered.fastq
$ wc -l small-celegans-sample.f*
  100000 small-celegans-sample.fastq
   99848 small-celegans-sample.filtered.fastq
  199848 total

nextflow.enable.dsl=2

//params.fastq_dir will be exposed as a pipeline input and is given a default here

params.fastq_dir = "./FASTQ/*.fq.gz"
//make a fastq ch
fastq_ch = Channel.fromPath(params.fastq_dir)

workflow {
//fastqc 
// takes in a fastq_ch and outputs a channel with fastqc html and zip files
raw_fastqc_ch = fastqc(fastq_ch)

//takes in a fastq_ch and outputs a channel with trimmed reads
trimmed_reads_ch = read_trimming(fastq_ch)

//takes in the trimmed reads channel this time
trimmed_fastqc_ch = fastqc_trimmed(trimmed_reads_ch)

//combine the two channels together to use them in multiqc 
combined_fastqc_ch = raw_fastqc_ch.mix(trimmed_fastqc_ch)

//takes in a channel containing fastqc files
//collect is used here to make all files available at the same time.
multiqc(combined_fastqc_ch.collect())
}

{
    "name": "fastqc",
    "title": "FastQC Reads Quality Control",
    "summary": "Generates a QC report on reads data",
    "dxapi": "1.0.0",
    "openSource": true,
    "version": "3.0.3",
    "inputSpec": [
        {
            "name": "reads",
            "label": "Reads",
            "help": "A file containing the reads to be checked. Accepted formats are gzipped-FASTQ and BAM.",
            "class": "file",
            "patterns": [
                "*.fq.gz",
                "*.fastq.gz",
                "*.sam",
                "*.bam"
            ]
        },
    ...
}

$ jq . minified.json > prettified.json
$ cat prettified.json
{
  "report_html": {
    "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
  },
  "stats_txt": {
    "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
  }
}

$ dx describe app-fastqc --json | head
{
    "id": "app-G81jg5j9jP7qxb310vg2xQkX",
    "class": "app",
    "billTo": "org-dnanexus_apps",
    "created": 1644399511000,
    "modified": 1644401066806,
    "createdBy": "user-jkotrs",
    "name": "fastqc",
    "version": "3.0.3",
    "aliases": [

$ dx find projects --public --json | head
[
    {
        "id": "project-F0yyz6j9Jz8YpxQV8B8Kk7Zy",
        "level": "VIEW",
        "permissionSources": [
            "PUBLIC"
        ],
        "public": true,
        "describe": {
            "id": "project-F0yyz6j9Jz8YpxQV8B8Kk7Zy",

$ jq '.[] | select (.state | contains("failed")) | .id, .executable' rap-jobs.json | head
"job-G6jj9k8JPXfG42094KG5JFX4"
"applet-G6jj9b0JPXf5Q6ZF4G85K156"
"job-G6jj1zQJPXf34z8v4KqjZKP1"
"applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
"job-G6jg9vQJPXfGbJb54GFkJ33Y"
"applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
"job-G6jg7Y0JPXfG6q53G12vQZK8"
"applet-G6jg6pQJPXf7ypXq33B75Qq1"
"job-G6jg57QJPXf90Jjv4K8pgkG7"
"applet-G6jfg90JPXfGZkVb7PPxjpPY"

$ jq '.[] | select (.state | contains("failed")) | .id, .executable' rap-jobs.json | paste - -
"job-G6jj9k8JPXfG42094KG5JFX4"  "applet-G6jj9b0JPXf5Q6ZF4G85K156"
"job-G6jj1zQJPXf34z8v4KqjZKP1"  "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
"job-G6jg9vQJPXfGbJb54GFkJ33Y"  "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
"job-G6jg7Y0JPXfG6q53G12vQZK8"  "applet-G6jg6pQJPXf7ypXq33B75Qq1"
"job-G6jg57QJPXf90Jjv4K8pgkG7"  "applet-G6jfg90JPXfGZkVb7PPxjpPY"
"job-G6jZk6jJPXf1q1Py5VKX6gJK"  "applet-G6jZjG0JPXf7ZxZP4G5v0X1k"
"job-G6jYY28JPXfFvFXY4GXB6jG2"  "applet-G6jYXq0JPXf5Q6ZF4G85JVgG"
"job-G6jY9FQJPXf3pj894GFJ02jy"  "applet-G6jY7zQJPXfG42094KG5Gkyy"
"job-G6jY858JPXfBKX1X0j434BY5"  "applet-G6jY7zQJPXfG42094KG5Gkyy"
"job-G6jY740JPXf7V2vJ4G2Gkfj7"  "applet-G6jY6zQJPXf81J984K6kfB3V"
"job-G6jY5v8JPXfPGQq15k77zPJ9"  "applet-G6jY5jjJPXf6Ffqg4GqF4KPg"
"job-G6jY4k0JPXfPGQq15k77zP9Q"  "applet-G6jY39jJPXfG42094KG5GkV9"
"job-G6jXPJQJPXfBbf694G3Fg07K"  "applet-G6jXJJjJPXf7V2vJ4G2GkFbF"
"job-G6jX7yQJPXfFjzffKJzpqfj7"  "applet-G6jX7JQJPXf3V99x4Gx7K09X"
"job-G6jVzJ0JPXf5Q6ZF4G85JG09"  "applet-G6jVxQQJPXfGZ0BF33KZfX5Y"

dx run "applet-G6jj9b0JPXf5Q6ZF4G85K156" --clone "job-G6jj9k8JPXfG42094KG5JFX4"
dx run "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp" --clone "job-G6jj1zQJPXf34z8v4KqjZKP1"
dx run "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp" --clone "job-G6jg9vQJPXfGbJb54GFkJ33Y"
dx run "applet-G6jg6pQJPXf7ypXq33B75Qq1" --clone "job-G6jg7Y0JPXfG6q53G12vQZK8"
dx run "applet-G6jfg90JPXfGZkVb7PPxjpPY" --clone "job-G6jg57QJPXf90Jjv4K8pgkG7"
dx run "applet-G6jZjG0JPXf7ZxZP4G5v0X1k" --clone "job-G6jZk6jJPXf1q1Py5VKX6gJK"
dx run "applet-G6jYXq0JPXf5Q6ZF4G85JVgG" --clone "job-G6jYY28JPXfFvFXY4GXB6jG2"
dx run "applet-G6jY7zQJPXfG42094KG5Gkyy" --clone "job-G6jY9FQJPXf3pj894GFJ02jy"
dx run "applet-G6jY7zQJPXfG42094KG5Gkyy" --clone "job-G6jY858JPXfBKX1X0j434BY5"
dx run "applet-G6jY6zQJPXf81J984K6kfB3V" --clone "job-G6jY740JPXf7V2vJ4G2Gkfj7"
dx run "applet-G6jY5jjJPXf6Ffqg4GqF4KPg" --clone "job-G6jY5v8JPXfPGQq15k77zPJ9"
dx run "applet-G6jY39jJPXfG42094KG5GkV9" --clone "job-G6jY4k0JPXfPGQq15k77zP9Q"
dx run "applet-G6jXJJjJPXf7V2vJ4G2GkFbF" --clone "job-G6jXPJQJPXfBbf694G3Fg07K"
dx run "applet-G6jX7JQJPXf3V99x4Gx7K09X" --clone "job-G6jX7yQJPXfFjzffKJzpqfj7"
dx run "applet-G6jVxQQJPXfGZ0BF33KZfX5Y" --clone "job-G6jVzJ0JPXf5Q6ZF4G85JG09"

Example 1: Word Count (wc)

In this example, you will:

Learn to write a native DNAnexus applet that executes a Python program
Use the dxpy module to download and upload files
Use the Python subprocess module to execute an external process and check the return value

Getting Started

We'll use the same scarlet.txt file from the bash version of the wc applet. Start off using dx-app-wizard and define the same inputs and outputs as before, but be sure to choose Python for the Programming language:

Python Template

The Python template looks like the following:

: DNAnexus execution environment entry point
The input_file listed in the inputSpec is passed to main.
Create a object.

Update src/python_wc.py to the following:

Import the function.
Use the local filename input_file.txt.
The output file will be called output.txt.
Shadow the input_file variable, overwriting it with the creation of a new

NOTE: Portable Operating System Interface (POSIX) standards dictate that processes return 0 on success (i.e., zero errors) and some positive integer value (usually in the range 1-127) to indicate an error condition.

Run dx build to build the applet. Create an job_input.json file with the file ID of your input:

Run your applet with the input file using --watch to see the output:

I can inspect the contents of the output file:

I can verify this is correct by piping the input file to a local execution of wc:

Debugging Locally

You can shorten the build/run development cycle by naming the JSON input job_input.json and executing the Python program locally:

This will download the input as input_file.txt and then create a new local file with the system call:

Review

You have now translated the bash applet for running wc into a native DNAnexus Python applet.
You were introduced to the dxpy module that provides functions for making API calls.
You used subprocess.getstatusoutput to call an external process and interpret the return value for success or failure.

In the next section, we'll continue translating bash to Python.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Example 3: fastq_trimmer

In this example, you will translate the bash app from the previous chapter into Workflow Definition Language (WDL).

You will learn how to:

Use Java Jar files to validate and compile WDL
Use WDL to define an applet's inputs, outputs, and runtime specs
Compile a WDL task into an applet

Getting Started

You will not use a wizard to start this applet, so manually create a directory for your work. Create a file called fastq_trimmer.wdl with the following contents:

This line indicates that the WDL follows the .
The task defines the body of the applet.
The input block defines the same inputs, a File called input_file and an Int (integer) value called quality_score with a default value of 30.

Checking and Compiling the WDL

To start, validate your WDL with WOMtool:

Before compiling the WDL into an applet, use dx pwd to ensure you are in your desired project. If not, run dx select to select a different project, then use the following command to compile the applet:

Use dx run as in the previous chapter to run the applet with the -h|--help option to that the usage looks identical to the bash version:

From the perspective of the user, there is no difference between native/bash applets and those written in WDL. You should use whichever syntax you find most convenient to the task at hand. For instance, this applet leverages an existing Docker container created by the rather than adding the binary as a resource.

You can run the applet using the command-line arguments as shown, or you can create a JSON file with the arguments as follows:

You can run the applet and watch the job with the following command:

The output will look quite different from the bash app, but the basics are still the same. In this version, notice that you do not need to download the inputs or upload the outputs. Once the input files are in place, the command block is run and the input files and variables are dereferenced properly. When the job has completed, run dx describe to see the inputs and outputs:

Download the output file to ensure it looks like a correct result:

Documentation with Makefiles

You may find it useful to create a Makefile with all the steps documented in a runnable fashion:

Now you can run make compile rather than type out the rather long Java command.

Review

The WDL version of the FastQTrimmmer applet is arguable simpler than the bash version. It uses just one file, fastq_trimmer.wdl, and about 20 lines of text, whereas the bash version requires at least dxapp.json, a bash script, and the resources tarball.

In this chapter, you learned how to:

Use a Biocontainers Docker image for the necessary binary executables from FASTX toolkit
Define the same inputs, outputs, and commands as the bash applet from Chapter 3
Use a Makefile to define project shortcuts to validate, compile, and run an applet

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

python_wc.py

#!/usr/bin/env python
# python_wc 0.1.0
# Generated by dx-app-wizard.
#
# Basic execution pattern: Your app will run on a single machine from
# beginning to end.
#
# See https://documentation.dnanexus.com/developer for documentation and
# tutorials on how to modify this file.
#
# DNAnexus Python Bindings (dxpy) documentation:
#   http://autodoc.dnanexus.com/bindings/python/current/

import os
import dxpy

@dxpy.entry_point('main') # 1
def main(input_file): # 2

    # The following line(s) initialize your data object inputs on the platform
    # into dxpy.DXDataObject instances that you can start using immediately.

    input_file = dxpy.DXFile(input_file) # 3

    # The following line(s) download your file inputs to the local file system
    # using variable names for the filenames.

    dxpy.download_dxfile(input_file.get_id(), "input_file") # 4

    # Fill in your application code here.

    # The following line(s) use the Python bindings to upload your file outputs
    # after you have created them on the local file system.  It assumes that you
    # have used the output field name for the filename for each output, but you
    # can change that behavior to suit your needs.

    outfile = dxpy.upload_local_file("outfile") # 5

    # The following line fills in some basic dummy output and assumes
    # that you have created variables to represent your output with
    # the same name as your output fields.

    output = {}
    output["outfile"] = dxpy.dxlink(outfile) # 6

    return output # 7

dxpy.run()

python_wc.py

#!/usr/bin/env python

import dxpy
import sys
from subprocess import getstatusoutput # 1


@dxpy.entry_point("main")
def main(input_file):
    local_file = "input_file.txt" # 2
    output_file = "output.txt" # 3

    input_file = dxpy.DXFile(input_file) # 4
    dxpy.download_dxfile(input_file.get_id(), local_file) # 5

    rv, out = getstatusoutput(f"wc {local_file} > {output_file}") # 6

    if rv != 0: # 7
        sys.exit(out)

    outfile = dxpy.upload_local_file(output_file) # 8
    return {"outfile": dxpy.dxlink(outfile)} # 9


dxpy.run()

$ dx run applet-GgGX740071xJY20Gjkp0JYXB -f python_wc/job_input.json \
    -y --watch \
    --destination project-GXY0PK0071xJpG156BFyXpJF:/output/python_wc/
Using input JSON:
{
    "input_file": {
        "$dnanexus_link": "file-GgGX7Y8071x46B02JGb515pB"
    }
}

Calling applet-GgGX740071xJY20Gjkp0JYXB with output destination
  project-GXY0PK0071xJpG156BFyXpJF:/output/python_wc

Job ID: job-GgGX8P0071x1yfFPkJ8662gQ

Job Log
-------
Watching job job-GgGX8P0071x1yfFPkJ8662gQ. Press Ctrl+C to stop watching.
* Python implementation of wc (python_wc:main) (running) job-GgGX8P0071x1yfFPkJ8662gQ
  kyclark 2024-02-23 16:03:24 (running for 0:01:39)
2024-02-23 16:11:36 Python implementation of wc INFO Logging initialized (priority)
2024-02-23 16:11:36 Python implementation of wc INFO Logging initialized (bulk)
2024-02-23 16:11:40 Python implementation of wc INFO Setting SSH public key
2024-02-23 16:11:42 Python implementation of wc STDOUT dxpy/0.369.0 (Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-23 16:11:43 Python implementation of wc STDOUT Invoking main with {'input_file': {'$dnanexus_link': 'file-GgGX7Y8071x46B02JGb515pB'}}
* Python implementation of wc (python_wc:main) (done) job-GgGX8P0071x1yfFPkJ8662gQ
  kyclark 2024-02-23 16:03:24 (runtime 0:01:36)
  Output: outfile = file-GgGXGFj0FbZxjvk1jZPJPkG2

version 1.0 

task fastq_trimmer { 
    input { 
        File input_file
        Int quality_score = 30
    }

    String basename = basename(input_file) 

    command <<<
        fastq_quality_trimmer -Q 33 -t ~{quality_score} \ 
            -i ~{input_file} -o ~{basename}.filtered.fastq
    >>>

    output { 
        File output_file = "~{basename}.filtered.fastq"
    }

    runtime { 
        docker: "biocontainers/fastxtools:v0.0.14_cv2"
    }
}

usage: dx run applet-GJ2pgv80vx84zJ4XJF6GPXz7 [-iINPUT_NAME=VALUE ...]

Applet: fastq_trimmer

Inputs:
  input_file: -iinput_file=(file)

  quality_score: [-iquality_score=(int, default=30)]

 Reserved for dxCompiler
  overrides___: [-ioverrides___=(hash)]

  overrides______dxfiles: [-ioverrides______dxfiles=(file) [-ioverrides______dxfiles=... [...]]]

Outputs:
  output_file: output_file (file)

$ dx run applet-GJ2pgv80vx84zJ4XJF6GPXz7 -f inputs.json -y --watch

Using input JSON:
{
    "input_file": {
        "$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"
    },
    "quality_score": 35
}

Calling applet-GJ2pgv80vx84zJ4XJF6GPXz7 with output destination
project-GJ2k24j0vx804FPyBbxqpQBk:/

Job ID: job-GJ2ppvQ0vx88k8bv9pvGyjGX

Job Log
-------
Watching job job-GJ2ppvQ0vx88k8bv9pvGyjGX. Press Ctrl+C to stop watching.

$ dx describe job-GJ2ppvQ0vx88k8bv9pvGyjGX
Result 1:
ID                    job-GJ2ppvQ0vx88k8bv9pvGyjGX
Class                 job
Job name              fastq_trimmer
Executable name       fastq_trimmer
Project context       project-GJ2k24j0vx804FPyBbxqpQBk
Region                aws:us-east-1
Billed to             org-sos
Workspace             container-GJ2ppx80773k09b8F6qKGJBb
Applet                applet-GJ2pgv80vx84zJ4XJF6GPXz7
Instance Type         mem1_ssd1_v2_x2
Priority              high
State                 done
Root execution        job-GJ2ppvQ0vx88k8bv9pvGyjGX
Origin job            job-GJ2ppvQ0vx88k8bv9pvGyjGX
Parent job            -
Function              main
Input                 input_file = file-GJ2k2V80vx88z3zyJbVXZj3G
                      quality_score = 35
Output                output_file = file-GJ2pv300773ypy03Jg2vYZ9f
...

$ dx download file-GJ2pv300773ypy03Jg2vYZ9f
[===========================================================>]
Completed 14,357,774 of 14,357,774 bytes (100%) ~/fastq_trimmer_wdl/small-celegans-sample.fastq.filtered.fastq
$ wc -l small-celegans-sample.fastq.filtered.fastq
   98624 small-celegans-sample.fastq.filtered.fastq

WDL = fastq_trimmer.wdl
PROJECT_ID = project-GJ2k24j0vx804FPyBbxqpQBk
DXCOMPILER = java -jar ~/dxCompiler.jar
CROMWELL = java -jar ~/cromwell.jar
WOMTOOL = java -jar ~/womtool.jar
WORKFLOW_ID = applet-GJ2pgv80vx84zJ4XJF6GPXz7

validate:
    $(WOMTOOL) validate $(WDL)

check:
    miniwdl check $(WDL)

compile:
    $(DXCOMPILER) compile $(WDL) \
        -archive \
        -folder /workflows \
        -project $(PROJECT_ID)

run:
    dx run $(WORKFLOW_ID) \
        -f inputs.json \
        --destination $(PROJECT_ID):/output \
        -y --watch

    "runSpec": {
        "timeoutPolicy": {
            "*": {
                "hours": 1
            }
        },
        "interpreter": "python3",
        "file": "src/python_fastq_trimmer.py",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0"
    },

python_fastq_trimmer.py

#!/usr/bin/env python3

import dxpy
import os
import sys
from subprocess import getstatusoutput


@dxpy.entry_point("main")
def main(input_file, quality_score): # 1
    input_file = dxpy.DXFile(input_file)
    desc = input_file.describe() # 2
    local_file = desc.get("name", input_file.get_id()) # 3
    dxpy.download_dxfile(input_file.get_id(), local_file)  # 4

    basename, ext = os.path.splitext(local_file) # 5
    outfile = f"{basename}.filtered{ext}" # 6
    cmd = ( # 7
        f"fastq_quality_trimmer -Q 33 -t {quality_score} "
        f"-i {local_file} -o {outfile}"
    )
    print(cmd) # 8
    rv, out = getstatusoutput(cmd) # 9

    if rv != 0:
        sys.exit(out)

    dx_output_file = dxpy.upload_local_file(outfile) # 10
    return {"output_file": dxpy.dxlink(dx_output_file)}


dxpy.run()

$ dx run applet-GgKQ5qQ071x5yX7fgbq3PkXB \
> -f python_fastq_trimmer/job_input.json -y --watch \
> --destination project-GXY0PK0071xJpG156BFyXpJF:/output/python_fastq_trimmer/

Using input JSON:
{
    "input_file": {
        "$dnanexus_link": "file-FvQGZb00bvyQXzG3250XGbgz"
    },
    "quality_score": 28
}

Calling applet-GgKQ5qQ071x5yX7fgbq3PkXB with output destination
  project-GXY0PK0071xJpG156BFyXpJF:/output/python_fastq_trimmer

Job ID: job-GgKQ6x0071x6kf34P5xy2q2b

Job Log
-------
Watching job job-GgKQ6x0071x6kf34P5xy2q2b. Press Ctrl+C to stop watching.
* Python version of fastq_trimmer (python_fastq_trimmer:main) (running)
* job-GgKQ6x0071x6kf34P5xy2q2b
  kyclark 2024-02-26 14:32:36 (running for 0:00:21)
2024-02-26 14:33:17 Python version of fastq_trimmer INFO Logging initialized
(priority)
2024-02-26 14:33:17 Python version of fastq_trimmer INFO Logging initialized
(bulk)
2024-02-26 14:33:21 Python version of fastq_trimmer INFO Downloading bundled
file resources.tar.gz
2024-02-26 14:33:22 Python version of fastq_trimmer STDOUT >>> Unpacking
resources.tar.gz to /
2024-02-26 14:33:22 Python version of fastq_trimmer STDERR tar: Removing
leading `/' from member names
2024-02-26 14:33:22 Python version of fastq_trimmer INFO Setting SSH public key
2024-02-26 14:33:23 Python version of fastq_trimmer STDOUT dxpy/0.369.0
(Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-26 14:33:23 Python version of fastq_trimmer STDOUT Invoking main with
{'input_file': {'$dnanexus_link': 'file-FvQGZb00bvyQXzG3250XGbgz'},
'quality_score': 28}
2024-02-26 14:33:24 Python version of fastq_trimmer STDOUT
fastq_quality_trimmer -Q 33 -t 28 -i small-celegans-sample.fastq -o
small-celegans-sample.filtered.fastq
* Python version of fastq_trimmer (python_fastq_trimmer:main) (done)
* job-GgKQ6x0071x6kf34P5xy2q2b
  kyclark 2024-02-26 14:32:36 (runtime 0:00:20)
  Output: output_file = file-GgKQ79j0B2FQjGbk0qX6j64B

$ dx head file-GgKQ79j0B2FQjGbk0qX6j64B
@SRR070372.1 FV5358E02GLGSF length=78
TTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTNTTTNTTTNTTTATTTATTTATTTATTATTATATATATATA
+SRR070372.1 FV5358E02GLGSF length=78
...000//////999999<<<=<<666!602!777!922!688:669A9=<=122569AAA?>@BBBBAA?=
@SRR070372.2 FV5358E02FQJUJ length=177
TTTCTTGTAATTTGTTGGAATACGAGAACATCGTCAATAATATATCGTATGAATTGAACCACACGGCACATATTTGAACTTGTTCGTGAAATTTAGCGAACCTGGCAGGACTCGAACCTCCAATCTTCGGATCCGAAGTCCGACGCCCCCGCGTCGGATGCGTTGTTACCACTGCTT
+SRR070372.2 FV5358E02FQJUJ length=177
222@99912088>C<?7779@<GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIC;6666IIIIIIIIIIII;;;HHIIE>944=>=;22499;CIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIH?;;;?IIEEEEEEEEIIII77777I7EEIIEEHHHHHIIIIIIIIIIIIII
@SRR070372.3 FV5358E02GYL4S length=70
TTGGTATCATTGATATTCATTCTGGAGAACGATGGAACATACAAGAATTGTGTTAAGACCTGCAT

Example 1: Word Count (wc)

To get started, you will build a native bash applet that will execute the venerable wc (word count) Unix command-line program on a file. In this example, you will:

Use the dx-app-wizard to create the skeleton of a native bash applet
Define the inputs and outputs of an applet
Use dx build to build the applet
Import data from a URL
Use dx run to run the applet

Understanding wc

The wc command takes one or more files as input. So that we have the same input file, please execute the following command to fetch the URL from Project Gutenberg and write the contents to the local file scarlet.txt:

Or use curl:

By default, wc will print the three columns showing the number of lines, words, and characters of text, in that order, followed by the name of the file:

The output from your version of wc may differ slightly as there are several implementations of the program. For instance, the preceding output is on macOS, which is the BDS version, but the applet will run on Ubuntu Linux using the GNU version. Both programs work essentially the same.

The goal of this applet will be to accept a single file as input and capture the standard out (aka STDOUT) of wc to report the number of lines, words, and characters in the file.

Using dx-app-wizard

Next, you will create an applet that will accept this file as input, transfer it to a virtual machine, run wc on the file, and return the preceding output as a new file. Run the dx-app-wizard to interactively answer questions about the inputs, outputs, and runtime requirements. Start by executing the program with the -h|--help flag to read the documentation:

As shown in the preceding usage, the name of the applet may be provided as an argument. For instance, you can run dx-app-wizard wc to answer the first question, which is the name of the applet. Note the naming conventions for the applet name, which you should also follow for naming the input and output variables:

Because the name was provided as an argument, the prompt shows [wc]. All the prompts will show a default value that will be used if you press the Enter key. If you wish to override this value, type a new name; otherwise, press Enter.

Next, you will be prompted for a title. The empty brackets ([]) indicate this is optional, but I will provide "Word Count":

Likewise, the summary is optional, but I will provide one:

Indicate the version with major, minor, and patch release:

The input specification follows. Use the name input_file for the first input name and whatever label you like. For the class, choose file to indicate that the user must supply a valid file, and specify that this input is not optional:

As this is the only input, press Enter when prompted for a second input and move to the output specification. To start, call the output outfile and use the class of file:

There is no other output for now, so press Enter to move on to the Timeout Policy. You may choose any amount of time you like such as "1h" to indicate 1 hour:

Next, you will choose whether to use bash or Python as the primary language of the applet. Choose bash:

Choosing bash means that your app will execute a bash script that will use commands from the dxpy module to do things like download and upload files as well as execute any command on the runtime instance, such as custom programs you write in Python, R, C, etc. Choosing Python here means that a Python script will be executed, and it can use the same Python module to do everything the bash script does. This tutorial will only demonstrate bash apps. There is no advantage one language has over the other. You should choose whichever suits your tastes.

During runtime, some apps may need to fetch resources from the internet or from the parent project. Neither of these will apply to this applet, so answer "no" for the next two questions:

Lastly, you will choose a default instance type on which the applet will run. I usually start with the default value, which is a fairly modest machine. If an applet proves it needs more resources, refer to the to choose something else:

The wizard will finish with a listing of the files it has created:

As noted, you will find the following structure in the directory wc:

A directory for tests, mostly used internally by DNAnexus.
A directory for assets like files or binaries you would like copied to the rutime instance.
A JSON file describing the metadata for the applet.
A documentation stub you may wish to update.

Inspecting dxapp.json

In the preceding step, the applet's inputs, outputs, and system requirements were written to the file dxapp.json, which is in JSON (JavaScript Object Notation) format. Open this file to inspect the contents, which begins with the basic metadata about the app:

The inputSpec section shows that this applet takes a single argument of the type file. Update the patterns to include .txt:

The outputSpec shows that the applet will return a file:

The runSpec describes the runtime for the applet:

The default VM is Ubuntu 20.04, which includes Python v3 and R v3. You may also indicate Ubuntu 16.04, which has Python v2.
If you need Ubuntu 16.04 with Python v3, indicate version 1 here; otherwise, leave this 0.

The author has more success installing Python v2 on Ubuntu 20.04 rather than running an older Linux distro.

Finally, the regionalOptions describe the system requirements:

You may use a text editor to alter this file at any time, after which you will need to rebuild the applet.

Editing the Runtime Code

As indicated in runSpec, the applet will execute the bash script src/wc.sh at runtime. The app wizard created a template that shows one method for download the input file and uploading the output file. Here is a modified version that removes most of the comments for the sake of brevity and adds the applet's business logic in the middle:

I've added this pragma to show each command as it's executed and to halt on undefined variables or failed system calls.
This will download the input file to a local file called input_file on the running instance.
Execute wc on input_file and redirect standard out to the file output.

The local variables $input_file and $output match the names used in the inputSpec and outputSpec. They will only exist at runtime.

Creating a Project for the Applet and Data

Applets and data must live inside a project, so create a new one either using the web interface or the command line by executing dx new project:

Next, you will add the scarlet.txt file to the project. There are several ways you can do this. From the web interface, you can click the "Add" button that will show you options two relevant options:

"Upload Data": This will allow you to upload a file your local computer. You can drag and drop the file into the dialog box or use the file browser to select the file.
"Add Data From Server": This will launch an app that can import files accessible by a URL such as from a web address or FTP server. You should use the Project Gutenberg URL from earlier.

You can also use the dx upload command. If you created the project using the web interface, you will first need to run dx select to select your project:

Note the file's ID, which we will use later for the applet's input. If you use the web interface to upload, you can click the information "I" in the circle to see the file's metadata.

From the command line, you can use dx ls with the -l|--long option to see the file ID:

Building and Running The Applet

It's impossible to debug this program locally, so next you will build the applet and run it. If you are in the wc directory, run dx build to build the applet; if you are in the directory above, run dx build wc to indicate the directory that contains the applet. Subsequent builds will require the use of of the -f|--overwrite or -a|--archive flag to indicate what to do with the previous version. For consistency's sake, I always run with the -f flag:

From the web interface, you can now view a web form that will allow you to execute the applet.

You do the same process that is listed in the Overview of that Platform section.

Running the Applet from the Command Line

You can also run the applet from the command line using the applet's ID. To begin, use dx run with the -h|--help flag to see the inputs and outputs of the applet:

Run the same command without the help flag to enter an interactive session where you can indicate the input file using the file's ID noted earlier:

You may also use specify the file on the command line:

Notice in both instances, the input is formatted as a JSON document for submission. Copy that JSON into a file with the following contents:

Use this file as the -f|--file input for the applet along with the -y flag to indicate you want to proceed without further confirmation and the --watch flag to enter into a watch of the applet's progress:

The end of the job's output should look like the following:

Run dx describe on the indicated output file ID to see the metadata about the file. Then execute dx cat to see the contents of the file, which should be the same results as when the program ran locally:

Review

In this chapter, you did the following:

Learned the structure of a native bash and how to use the wizard to create a new app
Built an app and ran it from the command line and the web interface
Inspected the output of an applet

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Example 5: workflow

In this example, you will learn:

How to to accept a BAM file as a workflow input
Break the BAM into slices by chromosome
Distribute the slices in parallel to count the number of alignments in each

Getting Started

To begin, create a new directory called view_and_count and a workflow.wdl file.

Here is the workflow defintion you should add:

The name of this workflow is bam_chrom_counter.
The workflow accepts a single, required File input that will be called bam as it is expected to be a BAM file.
Use a to define a String value of the Docker file containing Samtools.

Following is the slice_bam task that uses to index the input BAM file and break it into separate files for each of the 22 human chromosomes:

The inputs to this task are the BAM file and the name of the Docker image.
The command block uses triple-angle brackets because it must use the dollar sign ($) in shell code.
Use on the input BAM file for fast random access to the alignments.
The $()

The count_bam task is written to handle just one BAM slice:

This BAM input will be a slice of alignments for a given region. Naming this bam does not interfere with the bam variable in the workflow or any other task.
Use the command with -c|--count to count the number of alignments in the given file.
The output of this task uses the function to read the STDOUT from the command as an integer value.

At this point, I like to use miniwdl to check the syntax:

As no errors are reported, I will compile this onto the DNAnexus platform:

Finally, I will run this workflow using a sample BAM file:

Return to the DNAnexus website to monitor the progress of the analysis.

Placing Task Definitions in Files

As the number of tasks increase, workflow definitions can get quite long. You can shorten the workflow.wdl by placing each task in a separate file, which also makes it easier to reuse a task in a separate workflow. To do this, create a subdirectory called tasks, and then create a file called tasks/slice_bam.wdl with the following contents:

Also create the file tasks/count_bam.wdl with the following contents:

Both of the preceding tasks are identical to the original definitions, but note that the files include a version that matches the version of the workflow. Change workflow.wdl as follows:

Use to include WDL code from a file or URI. Note the use of the as clause to alias the imports using a different name.
Call task_slice_bam.slice_bam from the imported file using as to give it the same name as in the original workflow.
Do the same with task_count_bam.count_bam.

Use miniwdl to check your syntax, then use dxCompiler to create an app.

Review

In this lesson, you learned how to:

Accept a file as a workflow input
Define a non-input declaration
Use scatter to run tasks in parallel
Use the output from one task as the input to another task

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Example 2: Word Count (wc)

You can write the wc applet using Workflow Description Language (WDL), which is a high-level way to define and chain tasks. You will start by defining a single task, which compiles to an applet on the DNAnexus platform.

In this example, you will:

Write the wc applet using WDL

Introducting WDL

In the bash applet, the inputs, outputs, and runtime specifications are defined in the dxapp.json file, and the code that runs lives in a separate file. WDL combines all of this into a single file. Create a new directory for your work, and then add the following to a file called wc.wdl:

There are several versions of WDL, and this indicates the file will use .
A task in WDL will compile to an applet in DNAnexus.
The input block equates to the inputSpec from the previous chapter. Each input value is declared with a . Here the input is a File.

Validating WDL with WOMtool and miniwdl

First, ensure you have a working Java compiler and have installed all the Java Jar files as described in Chapter 1. Use WOMtool to validate the WDL syntax:

If you installed the Python miniwdl program, you can also use it to check the syntax. The output on success is something like a parse tree:

To demonstrate the output on error, I'll change the word File to Fiel:

Here is the equivalent error from WOMtool:

The two tools are written in different languages (Java and Python) and have different stringencies of parsing and different ways of reporting errors. You may find it helpful to use both to track down errors.

Compiling a WDL Task into an Applet

First, use dx pwd to check if you are in your wc project; if not, use dx select to change. Now you can use the dxCompiler jar file you downloaded in Chapter 1 to compile the WDL into an applet:

Run the new applet from the CLI with the help flag to inspect the usage:

Whether you use bash or WDL to write an applet, the compiled result works the same for the user.

Running the Applet

If you look in the web interface, you should see a new wc_wdl object in the project as shown in Figure 1.

Click on the applet to launch the user interface as shown in Figure 2. Select an input file and launch the applet.

As with the bash version, you can launch the applet using the command line arguments:

The output from the job will look different, but the result will be the same. You can use dx describe with the --json option to get a JSON document describing the entire job and pipe this to the tool to extract the output section:

The dx cat command allows you to quickly see the contents of the output file without having to download it to your computer:

This is the same output as from the previous chapter.

Review

Depending on your comfort level with WDL, you may or may not find this version simpler than the bash version. The result is the same no matter how you write the applet, so it's a matter of taste as to which you should select.

In this chapter, you learned how to:

Write a WDL task
Use WOMtool and miniwdl to validate WDL syntax
Compile a WDL task into an applet
Use the JSON output from dx describe and jq

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

dx-app-wizard
DNAnexus App Wizard, API v1.0.0
Basic Metadata
Please enter basic metadata fields that will be used to describe your app. Optional fields are denoted by options with square brackets. At the end of this wizard, the files necessary for building your app will be generated from the answers you provide.

The name of your app must be unique on the DNAnexus platform.  After creating your app for the first time, you will be able to publish new versions using the same app name.  App names are restricted to alphanumeric characters (a-z, A-Z, 0-9), and the characters ".", "_", and "-".

App Name: samtools_count_docker_bundle

You can publish multiple versions of your app, and the version of your app is a string with which to tag a particular version.  We encourage the use of Semantic Versioning for labeling your apps (see http://semver.org/ for more details).

Version [0.0.1]:

Input Specification
You will now be prompted for each input parameter to your app. Each parameter should have a unique name that uses only the underscore "_" and alphanumeric characters, and does not start with a number.

1st input name (<ENTER> to finish): bam 

Label (optional human-readable name) []: BAM File 

Your input parameter must be of one of the following classes: 
applet         array:file     array:record   file           int
array:applet   array:float    array:string   float          record
array:boolean  array:int      boolean        hash           string

Choose a class (<TAB> twice for choices): file

This is an optional parameter [y/n]: n

Output Specification
You will now be prompted for each output parameter of your app.  Each parameter should have a unique name that uses only the underscore "_" and alphanumeric characters, and does not start with a number.

1st output name (<ENTER> to finish): counts 

Label (optional human-readable name) []: Counts File 

Choose a class (<TAB> twice for choices): file

Timeout Policy
Set a timeout policy for your app. Any single entry point of the app that runs longer than the specified timeout will fail with a TimeoutExceeded error. Enter an int greater than 0 with a single-letter suffix (m=minutes,h=hours, d=days) (e.g. "48h").
Timeout policy [48h]:

Access Permissions
If you request these extra permissions for your app, users will see this fact when launching your app, and certain other restrictions will apply. For more information, see https://documentation.dnanexus.com/developer/apps/app-permissions.

Access to the Internet (other than accessing the DNAnexus API).
Will this app need access to the Internet? [y/N]: n

Direct access to the parent project. This is not needed if your app specifies outputs,which will be copied into the project after it's done running.

Will this app need access to the parent project? [y/N]: n

Default instance type: The instance type you select here will apply to all entry points in your app unless you override it. See https://documentation.dnanexus.com/developer/api/running-analyses/instance-types for more information.

Choose an instance type for your app [mem1_ssd1_v2_x4]:

*** Generating DNAnexus App Template... ***
Your app specification has been written to the dxapp.json file. You can specify more app options by editing this file directly (see https://documentation.dnanexus.com/developer for complete documentation).

Created files:
    samtools_count_docker_bundle/Readme.developer.md 
    samtools_count_docker_bundle/Readme.md 
    samtools_count_docker_bundle/dxapp.json  
    samtools_count_docker_bundle/resources/  
    samtools_count_docker_bundle/src/ 
    samtools_count_docker_bundle/src/samtools_count.sh 
    samtools_count_docker_bundle/test/  

App directory created!  See https://documentation.dnanexus.com/developer for tutorials on how to modify these files, or run "dx build samtools_count" or "dx build --create-app samtools_count_docker_bundle" while logged in with dx.
Running the DNAnexus build utility will create an executable on the DNAnexus platform.  Any files found in the resources directory will be uploaded so that they will be present in the root directory when the executable is run.

"inputSpec": [
    {
      "name": "bam",
      "label": "BAM file",
      "class": "file",
      "optional": false,
      "patterns": [
        "*.bam"
      ],
      "help": ""
    }
  ],
  "outputSpec": [
    {
      "name": "counts",
      "label": "counts file",
      "class": "file",
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],

"runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 3
      }
    },
    "interpreter": "bash",
    "file": "src/samtools_docker.sh",
    "distribution": "Ubuntu",
    "release": "24.04",
    "version": "0"
  },
  "regionalOptions": {
    "aws:us-east-1": {
      "systemRequirements": {
        "*": {
          "instanceType": "mem1_ssd1_v2_x4"
        }
      }
    }
  }
}

#!/bin/bash

set -exuo pipefail

main() { 
    echo "Value of bam: '$bam'" 
    dx download "$bam" -o "$bam_name"
    docker load  < "/samtools.tar.gz"
    counts_id=${bam_prefix}.counts.txt
    docker run -v /home/dnanexus:/home/dnanexus \
        biocontainers/samtools:v1.9-4-deb_cv1 samtools view -c "/home/dnanexus/${bam_name}" > "/home/dnanexus/${counts_id}" 

   upload=$(dx upload "$counts_id" --brief) 
    dx-jobutil-add-output counts "$upload" --class=file 
}

    "runSpec": {
        "timeoutPolicy": {
            "*": {
                "hours": 48
            }
        },
        "interpreter": "python3",
        "file": "src/python_cnvkit.py",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0",
        "assetDepends": [{"id": "record-GgP33b00BppJKpyyFxGpZJYf"}],
    }

python_cnvkit.py

#!/usr/bin/env python

import os
import dxpy
import re
import sys
from typing import List
from subprocess import getstatusoutput


@dxpy.entry_point("main")
def main(bam_tumor, reference):
    bam_tumor = [dxpy.DXFile(item) for item in bam_tumor] # 1

    reference = dxpy.DXFile(reference) # 2
    reference_name = reference.describe().get("name", "reference.cnn")
    dxpy.download_dxfile(reference.get_id(), reference_name)

    bam_dir = "bams"
    os.makedirs(bam_dir)

    bam_files = [] # 3
    for file in bam_tumor:
        desc = file.describe()
        file_id = file.get_id()
        path = os.path.join(bam_dir, desc.get("name", file_id))
        dxpy.download_dxfile(file_id, path) # 4
        bam_files.append(path)

    out_dir = "cnvkit-out"
    cmd = (
        f"cnvkit.py batch {' '.join(bam_files)} "
        f"-r {reference_name} "
        f"-p $(expr $(nproc) - 1) "
        f"-d {out_dir} --scatter"
    )
    print(cmd)

    rv, out = getstatusoutput(cmd) # 5
    if rv != 0:
        sys.exit(out)

    out_files = [os.path.join(out_dir, file) for file in os.listdir(out_dir)] # 6
    print('out_files = {",".join(out_files)}')

    return {
        "cns": upload("\.call\.cns$", out_files), # 7
        "cns_filtered": upload("(?<!\.call)\.cns$", out_files),
        "plot": upload("-scatter.png$", out_files),
    }


def upload(pattern: str, paths: List[str]) -> List[str]:
    """Upload files matching a pattern and return DX link"""

    regex = re.compile(pattern) # 8
    return [
        dxpy.dxlink(dxpy.upload_local_file(file)) # 9
        for file in filter(regex.search, paths) # 10
    ]


dxpy.run()

Job Log
-------
Watching job job-GgP7Z30071x73vpBzXK1jk7X. Press Ctrl+C to stop watching.
* CNVKit (python_cnvkit:main) (running) job-GgP7Z30071x73vpBzXK1jk7X
  kyclark 2024-02-27 17:10:52 (running for 0:01:57)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (priority)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (bulk)
2024-02-27 17:13:34 CNVKit INFO Downloading bundled file cnvkit_asset.tar.gz
2024-02-27 17:14:02 CNVKit STDOUT >>> Unpacking cnvkit_asset.tar.gz to /
2024-02-27 17:14:02 CNVKit STDERR tar: Removing leading `/' from member names
2024-02-27 17:15:36 CNVKit INFO Setting SSH public key
2024-02-27 17:15:39 CNVKit STDOUT dxpy/0.369.0
(Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-27 17:15:40 CNVKit STDOUT Invoking main with {'bam_tumor':
[{'$dnanexus_link': 'file-GFxXjV006kZVQPb20G85VXBp'}], 'reference':
{'$dnanexus_link': 'file-GFxXvpj06kZfP0QVKq2p2FGF'}}
2024-02-27 17:16:16 CNVKit STDOUT Running "cnvkit.py batch
bams/HCC1187_1x_tumor_markdup.bam -r reference.cnn -p $(expr $(nproc) - 1) -d
cnvkit-out --scatter"
2024-02-27 17:19:57 CNVKit STDOUT out_files = {",".join(out_files)}
* CNVKit (python_cnvkit:main) (done) job-GgP7Z30071x73vpBzXK1jk7X
  kyclark 2024-02-27 17:10:52 (runtime 0:07:54)
  Output: cns = [ file-GgP7jF80K7VPVpkkkzyqBK2Q ]
          cns_filtered = [ file-GgP7jF80K7V7q1jJVPYJj0pg, 
                           file-GgP7jFQ0K7VFfb7BJ3YbYy60 ]
          plot = [ file-GgP7jFQ0K7V115GPfGYB2j6b ]

{
"order": ["banner_image", "template_projects", "academy_links", "dnanexus_links"],
"components": {
"banner_image": {
     "type": "image",
     "id": "banner_image",
     "src": "#banner_image.png"
   },
 "template_projects": {
     "type": "project",
     "id": "template_projects",
     "title": "Template Projects",
     "query": {
       "tags": "Template Course",
       "limit": 5
     },
     "columns":[
       {
         "property": "name",
         "label": "Name"
       }, 
        {
          "property": "level",
          "formatter": "capitalize",
          "label": "Access"
        }
    ], 
    "viewMore": "/communities/academy_curriculum/projects",
        "minWidth": "400px"
}, 

"academy_links": {
     "type": "link",
     "id": "academy_links",
     "title": "DNAnexus Academy Links",
     "links": [
       {
         "name": "Academy Documentation",
         "href": "https://academy.dnanexus.com"
       }
     ], 
     "minWidth": "400px"
    }, 

"dnanexus_links": {
     "type": "link",
     "id": "dnanexus_links",
     "title": "DNAnexus Links",
     "links": [
       {
         "name": "DNAnexus Website",
         "href": "https://www.dnanexus.com"
       },
       {
         "name": "DNAnexus Documentation",
         "href": "https://documentation.dnanexus.com"
       }
     ],
     "minWidth": "400px"
    }
}
}

"banner_image": {
     "type": "image", 
     "id": "banner_image", #keep the ids lower case and with no spaces 
     "src": "#banner_image.png" #you will need an image when you upload, chnange this name to whatever you want to call it, but leave the # in front of it
   },

"template_projects": {
     "type": "project",
     "id": "template_projects",  #keep the ids lower case and with no spaces 
     "title": "Template Projects", #this is what will show up on the portal as the name 
     "query": {
       "tags": "Template Course", #this is the tag for my template course projects
       "limit": 5 #this is how many of the courses I want to show up
     },
     "columns":[ #these are the columns you want viewable as part of your table. I picked name and access level. 
       {
         "property": "name",
         "label": "Name"
       }, 
        {
          "property": "level",
          "formatter": "capitalize",
          "label": "Access"
        }
    ], 
    "viewMore": "/communities/academy_curriculum/projects", #this sets the parameter for a list of the rest of the projects with the tag that I have selected.  
        "minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another. 
},

"academy_links": {
     "type": "link",
     "id": "academy_links",  #keep the ids lower case and with no spaces 
     "title": "DNAnexus Academy Links", #title that shows up on the home page 
     "links": [
       {
         "name": "Academy Documentation", #name that shows up for the link 
         "href": "https://academy.dnanexus.com" #link I want used 
       }
     ], 
     "minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another. 
    },

"dnanexus_links": {
     "type": "link",
     "id": "dnanexus_links",  #keep the ids lower case and with no spaces 
     "title": "DNAnexus Links", #title that shows up for the home page 
     "links": [
       {
         "name": "DNAnexus Website", #name that shows up for the link
         "href": "https://www.dnanexus.com" #link I want used 
       },
       {
         "name": "DNAnexus Documentation", #name that shows up for the link
         "href": "https://documentation.dnanexus.com" #link I want used 
       }
     ],
     "minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another. 
    }
}

"example_image": {
     "type": "image",
     "id": "example-image", #id for order purposes
     "src": "https://example.com/image.png", #you can set the source for this as a public link or with a "#" if you have the image locally. 
     "alt": "Alt text" #text
   },

"table-example": {
     "type": "markdown", #format for the table 
     "id": "table_example", #id for the order of content 
     "title": "Table Example",
     "content": "LIST MARKDOWN CONTENT HERE FOR TABLE", #this will need to be your code for a table 
     "minWidth": "100px"
   },

seq

$ dx-app-wizard -h
usage: dx-app-wizard [-h] [--json-file JSON_FILE] [--language LANGUAGE]
                     [--template {basic,parallelized,scatter-process-gather}]
                     [name]

Create a source code directory for a DNAnexus app. You will be prompted for
various metadata for the app as well as for its input and output
specifications.

positional arguments:
  name                  Name of your app

optional arguments:
  -h, --help            show this help message and exit
  --json-file JSON_FILE
                        Use the metadata and IO spec found in the given file
  --language LANGUAGE   Programming language of your app
  --template {basic,parallelized,scatter-process-gather}
                        Execution pattern of your app

$ dx-app-wizard wc
DNAnexus App Wizard, API v1.0.0

Basic Metadata

Please enter basic metadata fields that will be used to describe your app.
Optional fields are denoted by options with square brackets.  At the end of
this wizard, the files necessary for building your app will be generated from
the answers you provide.

The name of your app must be unique on the DNAnexus platform.  After
creating your app for the first time, you will be able to publish new versions
using the same app name.  App names are restricted to alphanumeric characters
(a-z, A-Z, 0-9), and the characters ".", "_", and "-".
App Name [wc]:

The summary of your app is a short phrase or one-line description of
what your app does.  It can be any UTF-8 human-readable string.
Summary []: Find the number of lines, words, and characters in a file

You can publish multiple versions of your app, and the version of your
app is a string with which to tag a particular version.  We encourage the use
of Semantic Versioning for labeling your apps (see http://semver.org/ for more
details).
Version [0.0.1]: 0.1.0

Input Specification

You will now be prompted for each input parameter to your app.  Each parameter
should have a unique name that uses only the underscore "_" and alphanumeric
characters, and does not start with a number.

1st input name (<ENTER> to finish): input_file
Label (optional human-readable name) []: Input file
Your input parameter must be of one of the following classes:
applet         array:file     array:record   file           int
array:applet   array:float    array:string   float          record
array:boolean  array:int      boolean        hash           string

Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: n

Output Specification

You will now be prompted for each output parameter of your app.  Each
parameter should have a unique name that uses only the underscore "_" and
alphanumeric characters, and does not start with a number.

1st output name (<ENTER> to finish): output
Label (optional human-readable name) []: Output file
Choose a class (<TAB> twice for choices): file

Timeout Policy

Set a timeout policy for your app. Any single entry point of the app
that runs longer than the specified timeout will fail with a TimeoutExceeded
error. Enter an int greater than 0 with a single-letter suffix (m=minutes,
h=hours, d=days) (e.g. "48h").
Timeout policy [48h]: 1h

Access to the Internet (other than accessing the DNAnexus API).
Will this app need access to the Internet? [y/N]: n

Direct access to the parent project. This is not needed if your app
specifies outputs,     which will be copied into the project after it's done
running.
Will this app need access to the parent project? [y/N]: n

Default instance type: The instance type you select here will apply to
all entry points in your app unless you override it. See https://documenta
tion.dnanexus.com/developer/api/running-analyses/instance-types for more
information.
Choose an instance type for your app [mem1_ssd1_v2_x4]:

*** Generating DNAnexus App Template... ***

Your app specification has been written to the dxapp.json file. You can
specify more app options by editing this file directly (see
https://documentation.dnanexus.com/developer for complete documentation).

Created files:
     wc/Readme.developer.md
     wc/Readme.md
     wc/dxapp.json
     wc/resources/
     wc/src/
     wc/src/wc.sh
     wc/test/

App directory created!  See https://documentation.dnanexus.com/developer for
tutorials on how to modify these files, or run "dx build wc" or "dx build
--create-app wc" while logged in with dx.

Running the DNAnexus build utility will create an executable on the DNAnexus
platform.  Any files found in the resources directory will be uploaded
so that they will be present in the root directory when the executable is run.

  "inputSpec": [
    {
      "name": "input_file",
      "label": "Input file",
      "class": "file",
      "optional": false,
      "patterns": [
        **"*.txt"**
      ],
      "help": ""
    }
  ],

  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 1
      }
    },
    "interpreter": "bash",
    "file": "src/wc.sh",
    "distribution": "Ubuntu",
    "release": "20.04", 
    "version": "0" 
  },

#!/bin/bash

set -exo pipefail 

main() {
    echo "Value of input_file: '$input_file'"

    dx download "$input_file" -o input_file 

    wc input_file > output.txt 

    output_id=$(dx upload output.txt --brief) 

    dx-jobutil-add-output output "$output_id" --class=file 
}

$ dx select project-GGyG8K80K9ZKzkX812yY893V
Selected project project-GGyG8K80K9ZKzkX812yY893V
$ dx upload scarlet.txt
[===========================================================>]
Uploaded 513,523 of 513,523 bytes (100%) scarlet.txt
ID                    file-GGyG8z00K9Z9GQ9jG4qB4gpX
Class                 file
Project               project-GGyG8K80K9ZKzkX812yY893V
Folder                /
Name                  scarlet.txt
State                 closing
Visibility            visible
Types                 -
Properties            -
Tags                  -
Outgoing links        -
Created               Tue Oct  4 16:40:44 2022
Created by            kyclark
Last modified         Tue Oct  4 16:40:47 2022
Media type
archivalState         "live"
cloudAccount          "cloudaccount-dnanexus"

$ dx ls -l
Project: wc (project-GGyG8K80K9ZKzkX812yY893V)
Folder : /
State   Last modified       Size      Name (ID)
closed  2022-10-04 16:40:48 501.49 KB scarlet.txt (file-GGyG8z00K9Z9GQ9jG4qB4gpX)

$ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -h
usage: dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 [-iINPUT_NAME=VALUE ...]

Applet: Word Count

Find the number of lines, words, and characters in a file

Inputs:
  Input file: -iinput_file=(file)

Outputs:
  Output: output (file)

$ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06
Entering interactive mode for input selection.

Input:   Input file (input_file)
Class:   file

Enter file ID or path (<TAB> twice for compatible files in current directory,
'?' for more options)
input_file: file-GGyG8z00K9Z9GQ9jG4qB4gpX

Using input JSON:
{
    "input_file": {
        "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
    }
}

Confirm running the executable with this input [Y/n]: n

$ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -iinput_file=file-GGyG8z00K9Z9GQ9jG4qB4gpX

Using input JSON:
{
    "input_file": {
        "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
    }
}

Confirm running the executable with this input [Y/n]: n

$ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -f inputs.json -y --watch

Using input JSON:
{
    "input_file": {
        "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
    }
}

Calling applet-GGyGVP00K9Z4Z6VgBgkk0b06 with output destination
  project-GGyG8K80K9ZKzkX812yY893V:/

Job ID: job-GGyGZPQ0K9Z7PXybBp52P3xF

Job Log
-------
Watching job job-GGyGZPQ0K9Z7PXybBp52P3xF. Press Ctrl+C to stop watching.

2022-10-04 17:08:36 Word Count STDERR + wc input_file
2022-10-04 17:08:36 Word Count STDERR ++ dx upload output --brief
2022-10-04 17:08:37 Word Count STDERR + output=file-GGyGf100qZbvFjb3GqfG6kzj
2022-10-04 17:08:37 Word Count STDERR + dx-jobutil-add-output output
file-GGyGf100qZbvFjb3GqfG6kzj --class=file

version 1.0

workflow bam_chrom_counter { 
    input {
        File bam 
    }

    String docker_img = "quay.io/biocontainers/samtools:1.12--hd5e65b6_0" 

    call slice_bam {
        input : bam = bam, 
                docker_img = docker_img
    }

    scatter (slice in slice_bam.slices) { 
        call count_bam {
            input: bam = slice,
                   docker_img = docker_img
        }
    }

    output { 
        File bai = slice_bam.bai
        Array[Int] count = count_bam.count
    }
}

task slice_bam {
    input { 
        File bam
        String docker_img
    }

    command <<< 
    set -ex
    samtools index "~{bam}" 
    mkdir slices

    for i in $(seq 22); do 
        samtools view -b -o "slices/$i.bam" "~{bam}" "chr${i}" 
    done
    >>>

    runtime { 
        docker: docker_img
    }

    output { 
        File bai = "~{bam}.bai"
        Array[File] slices = glob("slices/*.bam") 
    }
}

task count_bam {
    input {
        File bam 
        String docker_img
    }

    command <<<
        samtools view -c "~{bam}" 
    >>>

    runtime {
        docker: docker_img
    }

    output {
        Int count = read_int(stdout()) 
    }
}

$ dx run workflow-GFqF27j07GyZ33JX4vzqgK32 \
> -istage-common.bam=file-G8V38KQ0zQ713kZGF6xQQvjJ -y

Using input JSON:
{
    "stage-common.bam": {
        "$dnanexus_link": "file-G8V38KQ0zQ713kZGF6xQQvjJ"
    }
}

Calling workflow-GFqF27j07GyZ33JX4vzqgK32 with output destination
  project-GFPQvY007GyyXgXGP7x9zbGb:/

Analysis ID: analysis-GFqF7Zj07GyZQ957Jy822gQY

version 1.0

task slice_bam {
    input {
        File bam
        String docker_img
    }

    command <<<
    set -ex
    samtools index "~{bam}"
    mkdir slices

    for i in $(seq 22); do
        samtools view -b -o "slices/$i.bam" "~{bam}" "chr${i}"
    done
    >>>

    runtime {
        docker: docker_img
    }

    output {
        File bai = "~{bam}.bai"
        Array[File] slices = glob("slices/*.bam")
    }
}

version 1.0

task count_bam {
    input {
        File bam
        String docker_img
    }

    command <<<
        samtools view -c "~{bam}"
    >>>

    runtime {
        docker: docker_img
    }

    output {
        Int count = read_int(stdout())
    }
}

version 1.0

import "./tasks/slice_bam.wdl" as task_slice_bam 
import "./tasks/count_bam.wdl" as task_count_bam

workflow bam_chrom_counter {
    input {
        File bam
    }

    String docker_img = "quay.io/biocontainers/samtools:1.12--hd5e65b6_0"

    call task_slice_bam.slice_bam as slice_bam { 
        input : bam = bam,
                docker_img = docker_img
    }

    scatter (slice in slice_bam.slices) {
        call task_count_bam.count_bam as count_bam { 
            input: bam = slice,
                   docker_img = docker_img
        }
    }

    output {
        File bai = slice_bam.bai
        Array[Int] count = count_bam.count
    }
}

version 1.0 

task wc_wdl { 
    input {
        File input_file 
    }

    command {
        wc ~{input_file} > wc.txt 
    }

    output {
        File outfile = "wc.txt" 
    }

    runtime {
        docker: "ubuntu:20.04" 
    }
}

$ dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B -h
usage: dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B [-iINPUT_NAME=VALUE ...]

Applet: wc_wdl

Inputs:
  input_file: -iinput_file=(file)

 Reserved for dxCompiler
  overrides___: [-ioverrides___=(hash)]

  overrides______dxfiles: [-ioverrides______dxfiles=(file)
    [-ioverrides______dxfiles=... [...]]]

Outputs:
  outfile: outfile (file)

$ dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B \
> -iinput_file=file-GGyG8z00K9Z9GQ9jG4qB4gpX -y --watch

Using input JSON:
{
    "input_file": {
        "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
    }
}

Calling applet-GJ3PxPj0K9Z68x1Y5zK4236B with output destination
  project-GGyG8K80K9ZKzkX812yY893V:/

Job ID: job-GJ3Q0V80K9Z54K2X9Bzf2v0B

Job Log
-------
Watching job job-GJ3Q0V80K9Z54K2X9Bzf2v0B. Press Ctrl+C to stop watching.

Cloud Workstation

The cloud_workstation app provides a Linux (Ubuntu) terminal running in the cloud, which is the same base execution environment for all DNAnexus apps. This is used most often for testing application code and building Docker images. I especially favor the cloud workstation whenever I need to work with large data files that I don't wish to copy to my local disk (laptop) as the transfer speeds are internal to AWS rather than over the open internet. If you have previously been limited to HPC environments where sysadmins determine what software may or may not be installed, you will find that you have sudo privileges to install any software you like, via apt, downloading pre-built binaries, or building from source code.

In order to run cloud workstation, you will need to set up a ssh key pair. You can do this by running the following command

Here is the start of the usage for the app:

As noted in the following usage, the default timeout is one hour, but can be changed if you need to.

In the preceding command, I also use the following flags from dx run:

-imax_session_length="2h": changes the max session length to 2 hours
-y|--yes: Do not ask for confirmation before launching job
--ssh: Configure the job to allow SSH access and connect to it after launching. Defaults --priority to high.

By default, this app will choose an 8-core in instance type such as "mem1_ssd1_v2_x8" (16G RAM, 200G disk) for AWS:us-east-1. This is usually adequate for my needs, but if I need more memory or disk space, I can specify any valid the --instance-type argument:

This is actually an argument to dx run, not the cloud workstation app. You can use this argument with any app to override the default instance chosen by the app developer.

The app produces no outputs. In the following sections, I want to focus on the inputs.

Maximum Session Length

As noted in the following usage, the default timeout is one hour.

You can set the usage to a different length by doing the following command, which sets the limit for 2 hours:

When on the workstation, you can find how much time is left using dx-get-timeout:

If you would like to extend the time left, use dx-set-timeout with the same values shown previously for session length. For example, you can set the timeout back to 2 hours and verify that you now have 2 hours left:

Input Files

You can initiate the app with any files you want copied to the instance:

One of the main use cases for the cloud workstation is working with large files, and I will mostly use dx download on the instance to download what I want. An especially important case is when I want to download a file to STDOUT rather than to a local file, in which case I would not want to initiate the app using this input. For example, when dealing with a tarball of an entire Illumina BCL run directory, I would prefer to download to STDOUT and pipe this into tar:

The alternative would require at least twice the disk space (to download the tarball and then expand the contents).

Snapshot

You can save the state of a workstation---called a "snapshot"---and start a new workstation using that saved state:

For instance, you may go through a lengthy build of various packages to create the environment you need to run some application that will be lost when the workstation stops.

To demonstrate, I will show that the Python module "pandas" is not installed by default:

I use python3 -m pip install pandas to install the module, then dx-create-snapshot to save the state of the machine, which shows:

I can use the file ID of the snapshot to reconstitute my environment:

Now I find that "pandas" does exist on the image:

You can use a snapshot file ID as an asset for native applets.

Instance Type

This is actually an argument to dx run, not the cloud workstation app. You can use this argument with any app to override the default instance chosen by the app developer.

Running Cloud Workstation

When the app secures an instance, you will be greeted by the following messages. The first shows the job ID, instance type, project ID, and the workspace container:

The next part explains that you are running the terminal multiplexer:

This means that pressing Ctrl-A to jump to the beginning of the line in the terminal will trigger the following Byobu configuration screen where you are prompted to choose whether to use Screen or Emacs mode:

If you choose Screen mode, then Byobu will emulate keystrokes, such as:

Ctrl-A, N: Next window
Ctrl-A, C: Create window
Ctrl-A, ": show list of windows

The next message is perhaps the most important:

This means that if you lose your connection to the workstation, the job will still continue running until you manually terminate it or the maximum session length is reached. For instance, you may lose your internet connection or accidentally close your terminal application. Also, your connection will be lost after an extended period of inactivity. To reconnect, use dx find jobs to find the job ID of the cloud workstation, and then use dx ssh <job-id> to pick up the Byobu session with all your work and windows in the same state.

Next, the message recommends you press F1 to read more about Byobu and how to switch screens:

Finally, the message reminds you that you have sudo privileges to install anything you like. The dx-toolkit is also installed, so you can run all dx commands:

The preceeding tip to use htop is especially useful. When developing application code, I will typically choose an instance type I estimate is appropriate to a task. I will download sample input files, install all the required software, run the commands needed for the app, then open a new screen (Ctrl-A, C) and run htop there to see resource usage.

This tip is also useful once you learn to build and run apps. You can shell into a running job using dx ssh <job-id> and connect to Byobu. To see how the system is performing in real time to a given input, you can use Ctrl-A, C to open a new screen to run htop.

The cloud workstation comes with several programming languages installed:

bash 5.x
Python 3.x
R 3.x
Perl 5.x

Note that you are not your DNAnexus username on the workstation but rather the dnanexus user:

This is not to be confused with your DNAnexus ID:

Relationship to Parent Project

Like any job, a cloud workstation must be run in the context of a DNAnexus project; however, if I execute dx ls on the workstation, I will not see the contents of the project. This is because the containing workspace is created for the job, which I can see the "Current workspace" value in dx env:

I can see more details by searching the workstation's environment for all the variables starting with DX:

The $DX_PROJECT_CONTEXT_ID variable contains the project ID:

I can run use this variable to see the parent project:

Any files left on the workstation after termination will be permanently destroyed. If I use dx upload to save my work, it will go into the workspace's container, not the parent project. To resolve this, I use the $DX_PROJECT_CONTEXT_ID variable to upload some output file to a results folder in the parent project:

Alternatively, I can run remove the DX_WORKSPACE_ID variable and change directories into the $DX_PROJECT_CONTEXT_ID:

After the preceeding command, dx ls and dx upload will reference the parent project rather than the container workspace.

The ttyd app runs a similar Linux terminal in the browser. Here are some differences to note:

You will enter as the root user.
Commands like dx ls and dx upload will default to the project not a container workspace.
There is no maximum session length, so ttyd runs until manually terminated. This can be costly if you forget to shut down the terminal.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Importing Nf-Core

Import via the User Interface (UI)

Here we will import the nf-core Sarek pipeline from github to demonstrate the functionality, but you can import any Nextflow pipeline from Github, not just nf-core ones!

Go to a DNAnexus project. Click Add and in the drop down menu select 'Import Pipeline/Workflow'

Next enter the required information (see below) and click 'Start Import'

The github url is from the url of the Sarek github repo (not what is in 'Clone' in the repo)

Make sure there is no slash after 'sarek' in the URL as it will cause the importer to fail.

Choose your folder in the USERS folder to output the applet to.

To see the possible releases to use, in the github project click 'Tags'. If you leave this part blank it will use the 'main' branch for that repo.

Click the 'Monitor' tab in your project to see the running/finished import job

You should see your applet in the the output folder that you specified in your project

You can see the version of dxpy that it was built with by looking at the job log for the import job

To do this click 'View Log' on the right hand side of the screen

The job log shows that the version of dxpy used here is dxpy v0.369.0

Test run the nfcore pipeline from the UI

We will run the test profile for sarek which should take 40 mins to 1 hour to run. The test profile inputs are the nextflow outdir and -profile test,docker ()

Click one of the sarek applets that you created

Choose the platform output location for your results.
Click on 'Output to' then make a folder or choose an existing folder. I choose the outputs folder.

Click 'Next'

Output directory considerations

Specify the nextflow output directory.

This is a directory local to the machine that Nextflow will be running on not a DNAnexus path.

The outdir path must start with ./ or have no slashes in front of it so that the executor will be able to make this folder where its is running on the head node. For example ./results and results are both valid but /results or things like dx://project-xx:/results etc will not produce output in your project. Once the dnanexus nextflow executor detects that all files have been written to this folder (and thus all subjobs completed), it will copy this folder to the specified job destination on platform. In the event that the pipeline fails before completion, this folder will not be written to the project.

Here I have chosen to place the nextflow output files in a directory on the head node of the run named ./test. This creates an outdir called test.

Thus once this job completes, my results will be in dx://project-xxx:/outputs/test

More details about this are found in our Documentation.

Where test is the folder that was copied from the head node of the Nextflow run to the destination that I specified for it on platform.

Scroll down and in 'Nextflow Options', 'Nextflow Run Options'
type -profile test,docker
You must use Docker for all Nextflow pipelines run on DNAnexus. Every nf-core pipeline has a Docker profile in it's nextflow.config file. You need to specify -profile docker in the Nextflow run options ('Nextflow Run Options' on UI, -inextflow_run_opts in CLI) of the applet CLI or UI to get it to use Docker containers for each process.
Then click 'Start Analysis'. You will be brought to this screen

Go to the Monitor tab to see your running job.

Note! The estimated cost per hour is the cost to run the head node only! Each instance of the nextflow processes (subjobs) will have their own instances with their own costs.

Import via the CLI

Select a project to build the applet in

and choose the number associated with your project.

Or select your project using its name or project ID

Replace the folder name with your folder name

This will place the sarek applet in a folder called sarek_v3.4.0_cli_import in the /USERS/FOLDERNAME folder in the project.

You can see the job running/completed in the Monitor tab of your project.

If you are using a private github repository, you can supply a git credentials file to dx build using the --git-credentials option. The git credentials file has the following format.

It must be stored in a project on platform. For more information on this file see .

Build via the CLI from a Local Folder

Build the Nextflow pipeline from a folder on your local machine

This approach is useful for building pipelines that you have built yourself into Nextflow applets and for pipelines that you do not have in a github repository.

It is also useful if you need to alter something from a public repo locally (e.g. change some code in a file to fix a bug without fixing it in the public repo) and want to build using the locally updated directory instead of the git repo.

Additionally, if you want to use the most up-to-date dxpy version, you will need to use this approach. Sometimes the workers executing the remote repository builds can be a version or two behind the latest release of dxpy. You may want to use the latest version of dxpy for instance if there was a bug in the Nextflow executor bundled with an older dxpy version that you do not want to run into.

For example, running dx version shows that I am using dx v0.370.2 which is what will be used for the applet we build with this approach.

However, we saw the UI and CLI import jobs used dxpy v0.369.0, which is .

Clone the git repository

Once you have selected the project to build in using dx select, then build using the --nextflow flag

You should see an applet ID if it has built successfully.

Note that this approach does not generate a job log and it will use the version of dxpy on your local machine. So if using dxpy v0.370.2, then the applet will be packaged with this version of dxpy and its corresponding version of nextflow (23.10.0 in this case)

Test run the nfcore pipeline from the CLI

To see the help command for the applet:

Use dx run <applet-name/applet-ID> -h

or using it's applet ID (useful when multiple versions of the applet with the same name as each version will have it's own ID). Also you can run an applet using its ID from anywhere in the project but if using its name you must dx cd etc to its folder before using it.

Excerpt of the help command

Run command

To run this, copy the command to your terminal and replace 'USERS/FOLDERNAME' with your folder name

Then press Enter.

You should see

Type y to proceed.

You can also add '-y' to the run command to get it to run without prompting e.g.,

You can track the progress of your job using the 'Monitor' tab of your project in the UI

Once the run successfully completes, your results will be in where test_run_cli is the folder on the head node of the nextflow run that is copied to the 'outputs' folder in your project on platform.

Note that as destination is a DNAnexus command and not a nextflow one it starts with '--' and does not have an '=' after it.

Controlling the number of parallel subjobs

In the CLI

By default the DNAnexus executor will only run 5 subjobs in parallel. You can change this by passing the -queue-size flag to nextflow_run_opts with the number you require. There is a limit of 100 subjobs per user per project for most users but you can give any number up to 1000 before it will give you an error as noted in the . For example, if you know that you are passing 20 files to a run and that only a few of subjobs can be run on all 20 files at a time you could set the queueSize to 60.

Lets change it to 20 for our nf-core Sarek run. Then the command would be

In the UI, the string would look as below

To change the Queue Size for your Applet at Build Time

You can also set the queue size when building your own applets in the nextflow.config. To change the default from 5 to 20 for your applet at build time, add this line to your nextflow.config

or (equivalent)

However, you can change the queue size at runtime, regardless of if it is mentioned in your nextflow.config or not, by passing -queue-size X where X is a number between 1 and 1000 to the nextflow run options.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

$ dx run cloud_workstation -h
usage: dx run cloud_workstation [-iINPUT_NAME=VALUE ...]

App: Cloud Workstation

Version: 2.2.1 (published)

This app sets up a cloud workstation which you can access by running the
applet with the --ssh or --allow-ssh flags

See the app page for more information:
  https://platform.dnanexus.com/app/cloud_workstation

Maximum Session Length (suffixes allowed: s, m, h, d, w, M, y):
      [-imax_session_length=(string, default="1h")]
      The maximum length of time to keep the workstation running.
      Value should include units of either s, m, h, d, w, M, y for
      seconds, minutes, hours, days, weeks, months, or years
      respectively.

Ctrl-A, K: Kill/delete window

Maximum Session Length (suffixes allowed: s, m, h, d, w, M, y):
      [-imax_session_length=(string, default="1h")]
      The maximum length of time to keep the workstation running.
      Value should include units of either s, m, h, d, w, M, y for
      seconds, minutes, hours, days, weeks, months, or years
      respectively.

dnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ python3
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'

dnanexus@job-GXfyj58071xB4VJ9X0yk75k3:~$ python3
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> help(pd.read_csv)

Welcome to DNAnexus!

This is the DNAnexus Execution Environment, running job-GXfvYxj071x5P87Fxx6f5k47.
Job: Cloud Workstation
App: cloud_workstation:main
Instance type: mem1_ssd1_v2_x8
Project: kyclark_test (project-GXY0PK0071xJpG156BFyXpJF)
Workspace: container-GXfvYyj0p4QgFgP4zZyBFv7Y
Running since: Tue Jul 11 21:31:40 UTC 2023
Running for: 0:01:37
The public address of this instance is ec2-3-90-239-144.compute-1.amazonaws.com.

Configure Byobu's ctrl-a behavior...

When you press ctrl-a in Byobu, do you want it to operate in:
    (1) Screen mode (GNU Screen's default escape sequence)
    (2) Emacs mode  (go to beginning of line)

Note that:
  - F12 also operates as an escape in Byobu
  - You can press F9 and choose your escape character
  - You can run 'byobu-ctrl-a' at any time to change your selection

Select [1 or 2]:

Use sudo to run administrative commands.
From this window, you can:
 - Use the DNAnexus API with dx
 - Monitor processes on the worker with htop
 - Install packages with apt-get install or pip3 install
 - Use this instance as a general-purpose Linux workstation
OS version: Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1031-aws x86_64)

$ dx env
Auth token used         4Gv26bY2YJ6gJjxGkV6Qg62B51X1VF7kq3gPZp2V
API server protocol     http
API server host         10.0.3.1
API server port         8124
Current workspace       container-GXfvYyj0p4QgFgP4zZyBFv7Y
Current folder          None
Current user            None

$ env | grep DX
DX_APISERVER_PROTOCOL=http
DX_JOB_ID=job-GXfvYxj071x5P87Fxx6f5k47
DX_APISERVER_HOST=10.0.3.1
DX_WATCH_PORT=8090
DX_WORKSPACE_ID=container-GXfvYyj0p4QgFgP4zZyBFv7Y
DX_PROJECT_CACHE_ID=container-GXfvYxj071x5P87Fxx6f5k48
DX_SNAPSHOT_FILE=null
DX_SECURITY_CONTEXT={"auth_token_type": "Bearer", "auth_token": "4Gv26bY2YJ6gJjxGkV6Qg62B51X1VF7kq3gPZp2V"}
DX_RESOURCES_ID=container-GKyz0G00FY38jv564gjXxb46
DX_THRIFT_URI=query.us-east-1.apollo.dnanexus.com:10000
DX_APISERVER_PORT=8124
DX_DXDA_DOWNLOAD_URI=http://10.0.3.1:8090/F/D2PRJ/
DX_PROJECT_CONTEXT_ID=project-GXY0PK0071xJpG156BFyXpJF
DX_RUN_DETACH=1

git clone --branch 3.4.0 https://github.com/nf-core/sarek.git
# Here I change the folder name to something with the version in it to help me keep track of different versions of sarek
mv sarek sarek_v3.4.0_cli

usage: dx run sarek_v3.4.0_ui [-iINPUT_NAME=VALUE ...]

Applet: sarek

sarek

Inputs:
  outdir: [-ioutdir=(string)]
        (Nextflow pipeline required)

  step: [-istep=(string)]
        (Nextflow pipeline required) Default value:mapping The pipeline starts
        from this step and then runs through the possible subsequent steps.

  input: [-iinput=(file)]
        (Nextflow pipeline optional) A design file with information about the
        samples in your experiment. Use this parameter to specify the location
        of the input files. It has to be a comma-separated file with a header
        row. See [usage docs](https://nf-co.re/sarek/usage#input).  If no
        input file is specified, sarek will attempt to locate one in the
        `{outdir}` directory. If no input should be supplied, i.e. when --step
        is supplied or --build_from_index, then set --input false
...

Example 3: samtools

Building a Native Applet with Bash

Using dx-app-wizard to Create An Applet

In this applet, I'll show how to count the number of reads in a SAM or BAM file using samtools. The (Sequence Alignment Map) is a tab-delimited text description for sequence alignments, and the BAM format is the same data but stored in binary format for more compression. As the SAM format uses a line break to delineate each record, counting the alignments could be as simple as using wc -l; however, the BAM format requires a program like samtools to read the input file, so I'll show how to install this into the applet's execution environment.

A minimal native applet requires just two files that exist in a directory with the same name as the applet:

dxapp.json: a JSON-formatted
a bash or Python program to execute

I'll use dx-app-wizard to create a skeleton applet structure with these files:

First, I must give my applet a name. The prompt shows that I must use only letters, numbers, a dot, underscore, and a dash. As stated earlier, this applet name will also be the name of the directory, and I'll use samtools_count:

Next, I'm asked for the title. Note that the prompt includes empty square brackets ([]), which contain the default value if I press Enter. As title is not required, it contains the empty string, but I will provide an informative title:

Likewise, the summary field is not required:

The version is also optional, and I will press Enter to take the default:

Input Specification

This applet requires a single input, as shows in Table 1.

Input Name

Label

Type

Optional

Default Value

When prompted for the first input, I'll enter the following:

The name of the input will be used as a variable in the bash code, so I will use only letters, numbers, and underscores as in bam or bam_file.
The label is optional, as noted by the empty square brackets.
The types include primitives like integers, floating-point numbers, and strings, as well as arrays of primitive types.

When prompted for the second input, press Enter:

Output Specification

As showing in Table 2, the applet will produce a single output file containing the number of alignments:

Output Name

Label

Type

When prompted for the first output name, I enter the following:

This name will also become a bash variable, so best practice is to use letters, numbers, and underscores.
The label is optional.
The class must be from the preceeding list. To be reminded of the choices, press the Tab key twice.

When prompted for the second output, press Enter:

Additional Settings

Here are the final settings I'll use to complete the wizard:

Name

Value

Applets are required to set a maximum time for running to prevent a job from running an excessively long time. While some applets may legitimately need days to run, most probably need something in the range of 12-48 hours. As noted in the prompt, I can use m, h, or d to specify minutes, hours, or days, respectively:

For the template language, I must select from bash or Python for the program that is executed when the applet starts. The applet code can execute any program available in the execution environment, including custom programs written in any language. I will choose bash:

Next, I determine if the applet has access to the internet and/or the parent project. Unless the applet specifically needs access, such as to download a file at runtime, it's best to answer no:

Lastly, I must specify a default instance type. The prompt includes an abbreviated list of . The final number indicates the number of cores, e.g., _x4 indicates 4 cores. The greater the number of cores, the more available memory and disk space. In this case, a small 4-core instance is sufficient:

The user is always free to override the instance type using the --instance-type option to dx run.

The final output from dx-app-wizard is a summary of the files that are created:

This file should contain applet implementation details.
This file should contain user help.
The answers from dx-app-wizard are used to create the app metadata.
The resources directory is for any additional files you want available on the runtime instance.

The contents of the resources directory will be placed into the root directory of the runtime instance. For instance, if you create a file resources/my_tool, then it will be available on the runtime instance as /my_tool. You would either need to reference the full path (/my_tool) or expand the $PATH variable to include /. Best practice is to create the directory structure resources/usr/local/bin/, and then the file will be at /usr/local/bin/my_tool as /usr/local/bin normally part of $PATH.

Reading dxapp.json

Let's look at the dxapp.json that was generated by dx-app-wizard. Note that this is a simple text file that you can edit at any time:

The inputSpec has a section for patterns where I will add a few Unix file globs to indicate acceptable file suffix:

The outputSpec needs no update:

The runSpec contains the timeout along with the indication to use bash to run src/samtools_count.sh. If you ever wanted to change the name or location of the run script, update this section:

Finally, the regionalOptions indicates the default runtime instance.

Installing Applet Dependencies

In the preceeding runSpec, note that the applet will run on Ubuntu 20.04. This instance will include dx-toolkit and several programming languages including bash, Python 3.x, Perl 5.x, and R 3.x. Anything else needed by the applet must be installed. Edit the runSpec to include the following execDepends to install samtools at runtime using the apt package manger:

The package_manager may be one of the following:

apt (Ubuntu)
pip (Python)
gem (Ruby)

Some caveats:

This runs apt install every execution, which is fine for fast installs. Some packages may take 5-15 minutes to install, in which case you will pay for those extra minutes on every run.
Installs current version in the package manager, which may be old. For instance, apt install v1.10 as of this writing while the current version is v1.17.
Your applet could break if the program changes if the package manager updates to a newer version.

Building An Asset

An alternative is to build an asset that the applet uses. Assets have many advantages, including:

Build asset once
Runtime installs are quick decompression of tarballs
Assets are static and cannot break your code

Create a new folder with the name of your asset.

Then, create the file dxasset.json in the folder with the following contents:

When I execute dx build_asset in the folder, a new job will run to build the asset:

As noted, the record ID of the asset can now be used in an assetDepends section, which should replace the execDepends:

Execute dx build_asset inside this directory to build the asset into the selected project. (You can also use the --destination option to specify where to place the asset file, which will be a tarball.)

The build process will create a new job to build the asset.

Writing Applet Code

The default src/samtools_count.sh contains many lines of comments to guide you in writing your application code. Update the file to the following:

This is the colloquially named "shebang" line that indicates this is a bash script.
Althought it's not a requirement that app code be contained in a main() function, it is best practice.
The original template uses echo to show you the runtime value of the inputs.

Remember that the $bam variable matches the name of the input in dxapp.json. If you ever wish to change this, be sure to update both the script and the JSON.

Building the Applet

Run dx build to create the applet on the DNAnexus platform.

If you have previous built the applet, you will be prompted to use the flags -f|--overwrite or -a|--archive flags:

As habit, I always use -f to force the build:

Without the -d|--destination option, the applet will be placed into the root directory of the project. I like to make an apps folder to hold my applets:

TIP: Best practice is to create folders for applets, resources, assets, etc.

Executing the Applet

Understanding the Code

I'd like to discuss this code a little more. In bash, the echo command will print to the console. As in any language, this is a great way to see what's happening when your code is running. In the following line, the $bam variable will only have a value at runtime, so you will not be able to run this script locally:

When I execute this code, I see output like the following:

That means that the following line:

Will execute the following command at runtime:

Take a look at the usage for dx download to remind yourself that the -o option here is directing that the output file name be input.bam:

The next line of code executes samtools view with the -c. Execute samtools view -h to read the documentation:

I often use a cloud workstation to work through app building. It's the same execution environment (Ubuntu Linux), so I will install any programs I need there, download sample input files, run commands and verify the behavior and output of the tools, etc.

If I download the input file NA12878.bam (file-FpQKQk00FgkGV3Vb3jJ8xqGV), I can run the following command to see that there are 60,777 aligments:

I can use Unix output redirection with > to place the output into the file counts.txt and cat to verify the output:

Therefore, the following line of code from the bash script place the count of the input BAM file into counts.txt:

Next, I upload the counts.txt file to the platform using the --brief option that will only show the new file ID:

In bash, I can use either backticks (``) or $() to capture the results from a command, so the following line captures the file ID into the variable counts_id:

I use add this new file ID as an output from the job using dx-jobutil-add-output:

Here is the last command of the script that sets the counts output variable defined in the dxapp.json to the new $counts_id value:

Using Input File Helper Variables

In the preceeding applet, the output filename is always counts.txt. It would be better for each output file to use the name of the input BAM. When I defined the bam input, I get four variables:

bam: the input file ID
bam_path: the default path to the downloaded input file
bam_name: the filename, also the output of basename($bam_path)

The default patterns for a file input in dxapp.json is ["*"]. This matches the entire input filename, causing bam_prefix to be the empty string.

TIP: Always be sure to set patterns to the expected file extensions.

Given an input file of NA12878.bam, the following code will create an output file called NA12878.txt:

Print out the additional variables.
Download the input file to the filename. The -o option here is superfluous as the default behavior is to download the file to it's filename. In the preceeding example, I saved it to the filename input.bam.
Define the variable outfile to use root of the input filename.

When I run this code, I can see the values of the other input file variables:

The bam_path value is the default path to write the bam file if I were to use dx-download-all-inputs. In this case, I used dx download with the -o option to write it to a file in the current working directory, so there is no file at that path.

Using dx-download-all-inputs

There are two ways to download the input files: one at a time or all at once. So far, I've shown the first way using dx download. The second way uses dx-download-all-inputs to download all the input files to the directory /home/dnanexus/in. This will contain a directory for each file input, so the bam input file will be placed into /home/dnanexus/in/bam as shown for the $bam_path in the preceeding section. If the input is an array:file, there will be additional numbered subdirectories for each of the runtime values.

Following is the usage:

I can change my code to use this:

Download the input file to the default location.
Use the $bam_prefix variable (e.g., NA12878) to create the outfile.
Use the $bam_path variable to execute samtools with the path to the in directory.

TIP: Using dx-download-all-inputs --parallel is best practice to download all input files as fast as possible.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

cpan (Perl)

Write the output from samtools to the preferred output filename.

$ dx-app-wizard
DNAnexus App Wizard, API v1.0.0

Basic Metadata

Please enter basic metadata fields that will be used to describe your app.
Optional fields are denoted by options with square brackets.  At the end of
this wizard, the files necessary for building your app will be generated from
the answers you provide.

The name of your app must be unique on the DNAnexus platform.  After
creating your app for the first time, you will be able to publish new versions
using the same app name.  App names are restricted to alphanumeric characters
(a-z, A-Z, 0-9), and the characters ".", "_", and "-".
App Name: samtools_count

You can publish multiple versions of your app, and the version of your
app is a string with which to tag a particular version.  We encourage the use
of Semantic Versioning for labeling your apps (see http://semver.org/ for more
details).
Version [0.0.1]:

Input Specification

You will now be prompted for each input parameter to your app.  Each parameter
should have a unique name that uses only the underscore "_" and alphanumeric
characters, and does not start with a number.

1st input name (<ENTER> to finish): bam 
Label (optional human-readable name) []: BAM File 
Your input parameter must be of one of the following classes: 
applet         array:file     array:record   file           int
array:applet   array:float    array:string   float          record
array:boolean  array:int      boolean        hash           string

Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: n

Output Specification

You will now be prompted for each output parameter of your app.  Each
parameter should have a unique name that uses only the underscore "_" and
alphanumeric characters, and does not start with a number.

1st output name (<ENTER> to finish): counts 
Label (optional human-readable name) []: Counts File 
Choose a class (<TAB> twice for choices): file

Timeout Policy

Set a timeout policy for your app. Any single entry point of the app
that runs longer than the specified timeout will fail with a TimeoutExceeded
error. Enter an int greater than 0 with a single-letter suffix (m=minutes,
h=hours, d=days) (e.g. "48h").
Timeout policy [48h]:

Access Permissions
If you request these extra permissions for your app, users will see this fact
when launching your app, and certain other restrictions will apply. For more
information, see
https://documentation.dnanexus.com/developer/apps/app-permissions.

Access to the Internet (other than accessing the DNAnexus API).
Will this app need access to the Internet? [y/N]: n

Direct access to the parent project. This is not needed if your app
specifies outputs,     which will be copied into the project after it's done
running.
Will this app need access to the parent project? [y/N]: n

Default instance type: The instance type you select here will apply to
all entry points in your app unless you override it. See https://documenta
tion.dnanexus.com/developer/api/running-analyses/instance-types for more
information.
Choose an instance type for your app [mem1_ssd1_v2_x4]:

*** Generating DNAnexus App Template... ***

Your app specification has been written to the dxapp.json file. You can
specify more app options by editing this file directly (see
https://documentation.dnanexus.com/developer for complete documentation).

Created files:
     samtools_count/Readme.developer.md # 1
     samtools_count/Readme.md  # 2
     samtools_count/dxapp.json  # 3
     samtools_count/resources/  # 4
     samtools_count/src/  # 5
     samtools_count/src/samtools_count.sh # 6
     samtools_count/test/  # 7

App directory created!  See https://documentation.dnanexus.com/developer for
tutorials on how to modify these files, or run "dx build samtools_count" or
"dx build --create-app samtools_count" while logged in with dx.

Running the DNAnexus build utility will create an executable on the DNAnexus
platform.  Any files found in the resources directory will be uploaded
so that they will be present in the root directory when the executable is run.

    "inputSpec": [
        {
            "name": "bam",
            "label": "BAM File",
            "class": "file",
            "optional": false,
            "patterns": [
                "*.bam"
            ],
            "help": ""
        }
    ],

    "outputSpec": [
        {
            "name": "counts",
            "label": "Counts File",
            "class": "file",
            "patterns": [
                "*"
            ],
            "help": ""
        }
    ],

    "runSpec": {
        "timeoutPolicy": {
            "*": {
                "hours": 48
            }
        },
        "interpreter": "bash",
        "file": "src/samtools_count.sh",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0"
    },

    "regionalOptions": {
        "aws:us-east-1": {
            "systemRequirements": {
                "*": {
                    "instanceType": "mem1_ssd1_v2_x4"
                }
            }
        }
    }
}

{
    "name": "samtools",
    "title": "samtools asset",
    "description": "samtools asset",
    "version": "1.10",
    "distribution": "Ubuntu",
    "release": "20.04",
    "execDepends": [
        {
          "name": "samtools",
          "package_manager": "apt"
        }
    ]
}

$ dx build_asset
...
* samtools (create_asset_focal:main) (done) job-GXjx8yj071x69xBVz90Zypx1
  kyclark 2023-07-14 16:04:27 (runtime 0:02:05)
  Output: asset_bundle = record-GXjx9V008bgjZqj82f5ybf16

Asset bundle 'record-GXjx9V008bgjZqj82f5ybf16' is built and can now be used
in your app/applet's dxapp.json

#!/bin/bash 

main() { 
    echo "Value of bam: '$bam'" 

    dx download "$bam" -o input.bam 

    samtools view -c input.bam > counts.txt 

    counts_id=$(dx upload counts.txt --brief) 

    dx-jobutil-add-output counts "$counts_id" --class=file 
}

-o OUTPUT, --output OUTPUT Local filename or directory to be used
                           ("-" indicates stdout output); if not supplied or
                           a directory is given, the object's name on the
                           platform will be used, along with any applicable
                           extensions

$ dx-jobutil-add-output -h
usage: dx-jobutil-add-output [-h] [--class [CLASSNAME]] [--array] name value

Reads and modifies job_output.json in your home directory to be a JSON hash
with key *name* and value  *value*.

If --class is not provided or is set to "auto", auto-detection of the
output format will occur.  In particular, it will treat it as a number,
hash, or boolean if it can be successfully parsed as such.  If it is a
string which matches the pattern for a data object ID, it will encapsulate
it in a DNAnexus link hash; otherwise it is treated as a simple string.

#!/bin/bash

main() {
    echo "Value of bam       : '$bam'" # 1
    echo "Value of bam_path  : '$bam_path'" 
    echo "Value of bam_name  : '$bam_name'"
    echo "Value of bam_prefix: '$bam_prefix'"

    dx download "$bam" -o "$bam_name"  # 2

    outfile="$bam_prefix.txt"  # 3

    samtools view -c "$bam_name" > "$outfile"  # 4

    counts_id=$(dx upload "$outfile" --brief)  # 5

    dx-jobutil-add-output counts "$counts_id" --class=file # 6
}

Value of bam       : '{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}'
Value of bam_path  : '/home/dnanexus/in/bam/NA12878.bam'
Value of bam_name  : 'NA12878.bam'
Value of bam_prefix: 'NA12878'

$ dx-download-all-inputs -h
usage: dx-download-all-inputs [-h] [--except EXCLUDE]
  [--parallel] [--sequential]

Note: this is a utility for use by bash apps running in the DNAnexus Platform.

Downloads all files that were supplied as inputs to the app.  By
convention, if an input parameter "FOO" has value

    {"$dnanexus_link": "file-xxxx"}

and filename INPUT.TXT, then the linked file will be downloaded into the
path:

    $HOME/in/FOO/INPUT.TXT

If an input is an array of files, then all files will be placed into
numbered subdirectories under a parent directory named for the input. For
example, if the input key is FOO, and the inputs are {A, B, C}.vcf then,
the directory structure will be:

    $HOME/in/FOO/0/A.vcf
                 1/B.vcf
                 2/C.vcf

Zero padding is used to ensure argument order. For example, if there are 12
input files {A, B, C, D, E, F, G, H, I, J, K, L}.txt, the directory
structure will be:

    $HOME/in/FOO/00/A.vcf
                 ...
                 11/L.vcf

This allows using shell globbing (FOO/*/*.vcf) to get all the files in the
input order.

options:
  -h, --help        show this help message and exit
  --except EXCLUDE  Do not download the input with this name. (May be used
                    multiple times.)
  --parallel        Download the files in parallel
  --sequential      Download the files sequentially

#!/bin/bash

main() {
    echo "Value of bam       : '$bam'"
    echo "Value of bam_path  : '$bam_path'"
    echo "Value of bam_name  : '$bam_name'"
    echo "Value of bam_prefix: '$bam_prefix'"

    dx-download-all-inputs # 1

    outfile="$bam_prefix.txt" # 2

    samtools view -c "$bam_path" > "$outfile" 

    counts_id=$(dx upload "$outfile" --brief)

    dx-jobutil-add-output counts "$counts_id" --class=file
}

Example 4: cnvkit

To begin, you'll create a bash app to run CNVKit, which will find "genome-wide copy number from high-throughput sequencing." Create a local directory to hold your work, and consider putting the contents into a source code repository like Git.

In this example, you will:

Use various package managers to install dependencies
Build an asset
Learn to use dx-download-all-inputs and dx-upload-all-outputs

Create a Project

From the web interface, select "Projects → All Projects" to see your project list. Click the "New Project" button to create a new project called "CNVkit." Alternatively, use dx new project to do this from the command line. However you choose to create a project, be sure this has been selected by running dx pwd to check your current working directory and using dx select to select the project, if needed.

Build a bash app with dx-app-wizard

Inside your working directory, run the command dx-app-wizard cnvkit_bash to launch the . Optionally provide a title, summary, and version at the prompts.

The Input Specification

The app will accept two inputs:

One or more BAM files of the tumor samples: Give this input the name bam_tumor with the label "BAM Tumor Files." For the class, choose array:file, and indicate that this is not an optional parameter.
A reference file: Give this input the name reference with the label "Reference." For the class, choose file, and indicate that this is not an optional parameter.

When prompted for the third input, press Enter to end the inputs.

The Output Specification

Define three outputs, each of the type array:file with the following names and whatever labels you feel are appropriate:

cns
cns_filtered
plot

Press Enter when prompted for the fourth output to indicate you are finished.

Other Options

Press Enter to accept the default value for the timeout policy.
Type bash for the programming language.
Type y to indicate that the app will need internet access.
Type n to indicate that the app will need access to the parent project.

You should see a message saying the app's template was created in a directory name matching the app's name. For instance, I have the following:

This is a JSON file containing metadata that will be used to create the app on the DNAnexus platform.
A stub for user documentation.
A stub for developer documentation.
A template bash script for the app's functionality.

Examine dxapp.json

The dxapp.json file that was created by the wizard should look like the following:

See the for a more complete understanding of all the possible fields and their implications.

Add Python and R Module Dependencies

CNVkit has dependencies on both Python and R modules that must be installed before running. In the dxapp.json, you can specify dependencies that can be installed with the following package managers:

apt (Ubuntu)
pip (Python)
cpan (Perl)

The Python module cnvkit can be installed via pip, but the software also requires an R module called DNAcopy that must be installed using , which must first be installed using cran. This means you'll have to manually install the DNAcopy module when the app starts.

To add these runtime dependencies, use a text editor to update the runSpec and add the following execDepends section that will install the Python cnvkit and R BiocManager modules before the app is executed:

Specify File Patterns for Inputs

In the inputSpec, change the patterns to match the expected file extensions:

bam_files: *.bam
reference: *.cnn

Your dxapp.json should now look like the following:

Edit the bash Code

The default bash code generated by the wizard starts with a generous header of comments that you may or may not wish to keep. The default code prints the values of the input variables, then downloads the input files individually. The app code belongs in the middle, after downloading the inputs and before uploading the outputs:

Replace src/cnvkit_bash.sh this with the following code:

Rather than downloading the inputs individually as in the original template, this version downloads the all inputs in parallel with the following command:

This will create an in directory with subdirectories named according to the input names. Note that bam_files input is an array of files, so this directory will contain numbered subdirectories starting at 0 for each input file:

Similarly, the preceding code uses dx-upload-all-outputs, which expects an out directory with subdirectories named according to each of the output specifications.

Build the Applet

Use dx pwd to ensure you are in the correct project and dx select to change projects, if necessary. If you are inside the bash source directory where the dxapp.json file exists, you can run dx build -f If you are in the parent directory, run dx build -f cnvkit_bash. Here is a sample output from successfully compiling the app:

The -f|--overwrite flag indicates you wish to overwrite any previous version of the applet. You may also want to use the -a|--archive flag to move any previous versions to an archived location. You won't need either of these flags the first time you compile, but subsequent builds will require that you indicate how to handle previous versions of the applet. Run dx build --help to learn more about build options.

Run the bash applet

Download this BAM file and add it to the inputs directory

Indicate an output directory, click the Run button, and then click the "View Log" to watch the job's progress.

You can also run the applet on the command line with the -h|--help flag to verify the inputs and outputs:

Select the input files on the web interface to note the file IDs that can be used to execute the app from the command line as follows:

You should see output from the preceding command that includes a JSON document with the inputs:

Note that you can place this JSON into a file and launch the applet with the inputs specified with the -f|--input-json-file option, as follows. Use dx run -h to learn about other command-line options:

Note the job ID from dx run, and use dx watch to watch the job to completion and dx describe to view the job's metadata. Alternatively, you can use the web platform to launch the job, using the file selector to specify each of the inputs, and then use the "Monitor" view to check the job's status, and view the output reference file when job completes.

Build an Asset

You'll notice the applet takes quite a while to run (around 14 minutes for me) because of the module installations. You can build an asset for these installations and use this in dxapp.json. Create a directory called cnvkit_asset with the following file dxasset.json:

Also create a Makefile with the following contents:

Run dx build_asset to create the asset. This will launch a job that will report the asset ID at the end:

Update the runSpec in dxapp.json to the following:

Use dx build -f and note the new app's ID. Create a JSON input as follows:

Launch the new app from the CLI with the following command:

Use dx watch with the new job ID to see how the run now uses the asset to run faster. I see about a 10-minute difference with the asset.

Review

You learned more ways to include app dependencies using package managers and a Makefile as well as by building an asset. The first strategy happens at runtime while the latter builds all the dependencies before the applet is run, making the runtime much faster.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

cran (\R)

{
  "name": "cnvkit_bash",
  "title": "cnvkit_bash",
  "summary": "cnvkit_bash",
  "dxapi": "1.0.0",
  "version": "0.0.1",
  "inputSpec": [
    {
      "name": "bam_tumor",
      "label": "BAM Tumor Files",
      "class": "array:file",
      "optional": false,
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "reference",
      "label": "Reference",
      "class": "file",
      "optional": false,
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "outputSpec": [
    {
      "name": "cns",
      "label": "CNS",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "cns_filtered",
      "label": "CNS Filtered",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "plot",
      "label": "Plot",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0"
  },
  "access": {
    "network": [
      "*"
    ]
  },
  "regionalOptions": {
    "aws:us-east-1": {
      "systemRequirements": {
        "*": {
          "instanceType": "mem1_ssd1_v2_x4"
        }
      }
    }
  }
}

"runSpec": {
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0",
    "execDepends": [
      {
        "name": "cnvkit",
        "package_manager": "pip"
      },
      {
        "name": "BiocManager",
        "package_manager": "cran"
      }
    ],
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    }
}

{
  "name": "cnvkit_bash",
  "title": "cnvkit_bash",
  "summary": "cnvkit_bash",
  "dxapi": "1.0.0",
  "version": "0.0.1",
  "inputSpec": [
    {
      "name": "bam_tumor",
      "label": "BAM Tumor Files",
      "class": "array:file",
      "optional": false,
      "patterns": [
        "*.bam"
      ],
      "help": ""
    },
    {
      "name": "reference",
      "label": "Reference",
      "class": "file",
      "optional": false,
      "patterns": [
        "*.cnn"
      ],
      "help": ""
    }
  ],
  "outputSpec": [
    {
      "name": "cns",
      "label": "CNS",
      class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "cns_filtered",
      "label": "CNS Filtered",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    },
    {
      "name": "plot",
      "label": "Plot",
      "class": "array:file",
      "patterns": [
        "*"
      ],
      "help": ""
    }
  ],
  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "execDepends": [
      {
        "name": "cnvkit",
        "package_manager": "pip"
      },
      {
        "name": "BiocManager",
        "package_manager": "cran"
      }
    ],
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0"
  },
  "access": {
    "network": [
      "*"
    ]
  },
  "regionalOptions": {
    "aws:us-east-1": {
      "systemRequirements": {
        "*": {
          "instanceType": "mem1_ssd1_v2_x4"
        }
      }
    }
  }
}

main() {

    echo "Value of bam_tumor: '${bam_tumor[@]}'"
    echo "Value of reference: '$reference'"

    # The following line(s) use the dx command-line tool to download your file
    # inputs to the local file system using variable names for the filenames. To
    # recover the original filenames, you can use the output of "dx describe
    # "$variable" --name".

    dx download "$reference" -o reference
    for i in ${!bam_tumor[@]}
    do
        dx download "${bam_tumor[$i]}" -o bam_tumor-$i
    done

    >>>>> Here is where the app code belongs <<<<<

    # The following line(s) use the dx command-line tool to upload your file
    # outputs after you have created them on the local file system.  It assumes
    # that you have used the output field name for the filename for each output,
    # but you can change that behavior to suit your needs.  Run "dx upload -h"
    # to see more options to set metadata.

    cns=$(dx upload cns --brief)
    cns_filtered=$(dx upload cns_filtered --brief)
    plot=$(dx upload plot --brief)

    # The following line(s) use the utility dx-jobutil-add-output to format and
    # add output variables to your job's output as appropriate for the output
    # class.  Run "dx-jobutil-add-output -h" for more information on what it
    # does.

    dx-jobutil-add-output cns "$cns" --class=file
    dx-jobutil-add-output cns_filtered "$cns_filtered" --class=file
    dx-jobutil-add-output plot "$plot" --class=file
}

#!/bin/bash

# Set pragmas to print commands and fail on errors
set -exuo pipefail

# Install required R module
Rscript -e "BiocManager::install('DNAcopy')"

# Verify the value of inputs
echo "Value of bam_tumor: '${bam_tumor[@]}'"
echo "Value of reference: '$reference'"

# Place all inputs into the "in" directory
dx-download-all-inputs --parallel

# Use "_path" versions of inputs for file paths
cnvkit.py batch \
    ${bam_tumor_path[@]} \
    -r ${reference_path} \
    -p $(expr $(nproc) - 1) \
    -d cnvkit-out/ \
    --scatter

# Make out directories for each output spec
mkdir -p ~/out/cns/ ~/out/cns_filtered/ ~/out/plot/

# Move CNVkit outputs to the "out" directory for upload
mv cnvkit-out/*.call.cns    ~/out/cns_filtered/
mv cnvkit-out/*.cns         ~/out/cns/
mv cnvkit-out/*-scatter.png ~/out/plot/

# Upload and annotate all output files
dx-upload-all-outputs --parallel

$ dx run applet-GFyV3kj0VGFkV8k04f3K11QY -h
usage: dx run applet-GFyV2G8054JBQXY64g4F7ZKk [-iINPUT_NAME=VALUE ...]

Applet: cnvkit_bash

cnvkit_bash

Inputs:
  BAM Tumor Files: -ibam_tumor=(file) [-ibam_tumor=... [...]]

  Reference: -ireference=(file)

Outputs:
  CNS: cns (array:file)

  CNS Filtered: cns_filtered (array:file)

  Plot: plot (array:file)

Using input JSON:
{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}

{
    "name": "cnvkit_asset",
    "title": "cnvkit_asset",
    "description": "cnvkit_asset",
    "version": "0.0.1",
    "distribution": "Ubuntu",
    "release": "20.04",
    "execDepends": [
        {
          "name": "cnvkit",
          "package_manager": "pip"
        },
        {
          "name": "BiocManager",
          "package_manager": "cran"
        }
    ]
}

  "runSpec": {
    "timeoutPolicy": {
      "*": {
        "hours": 48
      }
    },
    "assetDepends": [{"id": "record-GFyVY000X1ZK3yGg4qv32GXv"}],
    "interpreter": "bash",
    "file": "src/cnvkit_bash.sh",
    "distribution": "Ubuntu",
    "release": "20.04",
    "version": "0"
  },

$ cat inputs.json
{
    "bam_tumor": [
        {
            "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
        }
    ],
    "reference": {
        "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
    }
}

Example 1: hello

To begin, you'll code a "Hello, World!" workflow that captures the output of a command into a file. WDL syntax may look familiar if you know any C-family language like Java or Perl. For example, keywords like workflow and task are used to define blocks of statements contained inside matched curly braces ({}), and variables are defined using a data type like String or File.

In this example, you will:

Write a simple workflow in WDL
Learn two ways to capture the standard out (STDOUT) of a command block

Getting Started

To see this in action, make a hello directory for your work, and inside that create the file workflow.wdl with the following contents:

The states that the following WDL follows the specification.
The keyword defines a workflow name. The contents of the workflow are enclosed in matched curly braces.
The block describes the parameters for the workflow.
WDL defines several you can use to describe an input value. This workflow requires a String

WDL is not whitespace dependent, so indentation is based on your preference.

In the Setup section, you should have installed the tool, which can be useful to check the syntax of your WDL. The following command shows the output when there are no problems:

Introduce an error in your WDL to see how the output changes. For instance, change the version to 2.0 and observe the error message:

Or change the call to write_greetings:

Cromwell will also find this error, but the message will be one of literally thousands of lines of output.

Note that miniwdl uses a different parser than dxCompiler, and each has slightly different ideas of what constitutes valid syntax. For example, miniwdl requires commas in between input items but dxCompiler does not. In spite of their differences, I appreciate the concise reporting of errors that miniwdl provides.

Executing WDL locally with Cromwell

To execute this workflow locally using Cromwell, you must first create a JSON file to define the input name. Create a file called inputs.json with the following contents if you'd like to extend salutations to my friend Geoffrey:

Next, run the following command to execute the workflow:

The output will be copious and should include an indication that the command was successful and the output landed in a file in the cromwell-executions directory that was created:

You can use the cat (concatenate) command to see the contents of the file. Be sure to change the file path to the one created by your execution:

Here is another way to write the command block and capture STDOUT to a named file:

The command block here uses triple angle brackets to enclose the shell commands.
The variable must be with ~{} because of the triple angle brackets. The Unix redirect operator > is used to send the STDOUT from echo into the file out.txt.

If you execute this version, the output should show that the file out.txt was created instead of the file stdout:

I can use cat again to verify that the same file was created:

Creating a WDL applet with dxCompiler

Now that you have verified that the workflow runs correctly on your local machine, it's time to compile this onto the DNAnexus platform. First, create a project in your organization and take note of the project ID. I'll demonstrate using the dx command-line interface to create a project called Workflow Test:

All the dx commands will print help documentation if you supply the -h or --help flags. For instance, run dx new project --help.

You can also use the web interface, in which case you should use dx select to switch to the project. Next, use dxCompiler to compile the workflow into a workflows directory in the new project. In the following command, the dxCompiler prints the new workflow ID upon success:

Running a Workflow from the Web Interface

Use the web interface to inspect the new workflow as shown in Figure 1. Click on the info button (an "i" in a circle to the right of the "Run" button) to verify the workflow ID is the same as you see on the command line.

Use the "Run" button in the web interface to launch the applet as shown in Figure 2. As shown in Figue 2, I indicate the applet's outputs should written to the outputs directory.

Click on the "Analysis Inputs" view to specify a name for the greeting. In Figure 3, you see I have selected the name "Jonas."

Click "Start Analysis" to start the workflow. The web interface will show the progress of running the applet as shown in Figure 4.

Figure 5 shows check marks next to each step that has been completed. Click the button to show inputs and outputs, then click on the link to the output file, which may be stdout or out.txt depending on the version of the workflow you compiled.

Click on the output file name to view the contents of the file as shown in Figure 6.

Click on the "Monitor" view to see how long the job lasted and cost as shown in Figure 7.

Running a Workflow from the Command Line

You can also use the dx CLI to run the applet as shown in the following interactive session:

You can also specify the input JSON on the command line as a string or a file. In the following command, I provide the JSON as a string. Also note the use of the -y (yes) flag to have the workflow run without confirmation:

You can also place the JSON into a file like so:

You can execute the workflow with this JSON file as follows:

You may also run the workflow with the -h|--help flag to see how to pass the arguments on the command line:

For instance, you can also launch the app using the following command to greet "Keith":

However you choose to launch the workflow, the new run should be displayed in the "Monitor" view of the web interface. As shown in Figure 8, the new run finished in under 1 minute.

To find out more about the latest run, click on job's name in the run table. As shown in Figure 9, the platform will reuse files from the first run as it sees that nothing has changed. This is called "smart reuse," and you can disable this feature if you like.

You can also use the CLI to view the results of the run with the dx describe command:

Notice in the preceding output that the Output lists file-GFbPkBj0XFYgB7Vj4pF8XXBQ. You can cat the contents of this file with the CLI:

Alternately, you can download the file:

The preceding command should create a new local file called stdout or out.txt, depending on the version of the workflow you compiled. Use the cat command again to verify the contents:

Creating Command Shortcuts Using a Makefile

You can create command-line shortcuts for all the steps of checking and buildingyour workflow by recording them as targets in a Makefile as follows:

(or a similar Make program, which you may need to install) can turn the command make local into the listed Cromwell command to run one of the workflow versions. Makefiles are a handy way to document your work and automate your efforts.

Review

You should now be able to do the following:

Write a valid WDL workflow that accepts a string input and interpolates that string in a bash command.
Capture the standard output of a command block either using the stdout() WDL directive or by redirecting the output of a Unix command to a named file.
Define a File type output from a task

In the next section, you'll learn how to accept a file input and launch parallel processes to speed execution of large tasks.

Review

In this chapter, you learned some more WDL functions.

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Building Nextflow Applets

Pipeline Script Folder Structure

Building and running nextflow pipelines on dnanexus.

A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

(Required) A major Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file using manifest.mainScript = 'myfile.nf'
(Optional, recommended) A nextflow.config file.

Reviewing an example minimal nextflow applet

Create the code for fastqc-nf

We are going to add each file into a folder called fastqc-nf

This is a very simple applet containing only one process which runs FASTQC on files specified using an input samplesheet or from a folder in a project on platform.

It has only three files:

main.nf : The pipeline script file
nextflow.config : Contains config info and sets params
nextflow_schema.json : Specifies the information used by the UI/CLI run command to serve the nextflow params to the user on DNAnexus

The main.nf file

Lets look at the main.nf file. As a reminder this can be called a different name and the new name specified in the nextflow.config file using manifest.mainScript = 'myfile.nf' if needed.

main.nf

DNAnexus expects Nextflow pipelines to use the Nextflow DSL2 standard. If you have learned Nextflow after December 2022 (when Nextflow version 22.12.0 was released) you are using DSL2.
- "In Nextflow version 22.03.0-edge, DSL2 became the default DSL version. In version 22.12.0-edge, DSL1 support was removed, and the Nextflow documentation was updated to use DSL2 by default."
Each process must use a Docker container to define the software environment for the process. See for more information on using docker containers in nextflow processes. Here I am using a public docker image on quay.io. This is the same docker container used by the . You might notice that the container line in the nfcore fastqc module is missing 'quay.io'. This is because this part is expected to be given in the nextflow.config using docker.registry = quay.io

An example of using publishDir multiple times in one process to send outputs to subfolders

Only the 'copy' mode of publishDir is supported on DNAnexus. If you do not specify a mode, then the DNAnexus executor will use copy by default so both of the publishDir lines in the example above are valid.

Assuming at runtime you assign outdir the value of './results', this example places all output files with the ending .html in ./results/fastqc/html and all output files with ending .zip in ./results/fastqc/zip in the head node of the nextflow run.

The entire outdir with subfolder structure intact will be copied to platform location specifed by `--destination' in the CLI or 'Output to' in the UI, once all subjobs have been completed.

Only relative paths are allowed for publishDir on DNAnexus and thus params.outdir (since this is where files are published to)

General . Do not attempt to access files in the publishDir directories from within a nextflow script as this is bad practice for many reasons. Use channels to pass files between processes.

In this example applet, I have placed the process and workflow parts in the main.nf script. For larger multi-process applets, you can place your processes in modules/workflows/subworkflows and import them into the main script as done in nfcore pipelines.

The nextflow.config file

Full File:

Explanation of Each Section:

Enable docker by default for this pipeline

Define the input parameters. You can also do this in the main.nf script but by convention nfcore pipelines do it in the nextflow.config. There are three params in this workflow, 'samplesheet' which is a file input, 'reads_dir' which is a directory path and 'outdir' which is a string defining the name of the output folder.

Here I have assigned samplesheet and reads_dir the value of null. Thus if the user does not provide a samplesheet or a reads_dir to the pipeline at runtime, the pipeline will fail. For items such as the samplesheet that should always or nearly always change at runtime, it is valuable to assign them a null value instead of a default so that a user does not accidentally run the pipeline with a default samplesheet thinking they have used a different one.
Here outdir is assigned a default of './results'. Thus, if a user does not specify a string for outdir at runtime, it will use './results'. If a user does specify an outdir, it will use the user specified one instead.

A common command to make the process fail quickly and loudly when it encounters an issue .

Error Strategy I have not defined an error strategy in the nextflow.config file. Thus, the default (both local Nextflow executor and DNAnexus executor) strategy is 'terminate'. For more detailed information on choosing an errorStrategy

queue-size I have also not defined the queueSize, so when this applet is run, a max of 5 subjobs will run at any one time in parallel, unless you pass the -queue-size flag to the nextflow_run_opts options for the applet

The nextflow_schema.json file

The nextflow_schema.json file is needed to reflect the nextflow params (--samplesheet, --reads_dir and --outdir in this case) as DNAnexus applet inputs in the CLI and UI. If it is not present, you will not get the -isamplesheet, -ireads_dir and -ioutdir options for your applet inputs. You can also use it to do parameter validation at runtime using plugins such as .

nextflow_schema.json

Creating a nextflow_schema.json file

Once you have written your script and know your parameters, you can make the schema quite quickly using the . Note: do not put sensitive information into this builder as information in it is stored by nfcore for 2 weeks.

There is also the option of using nfcore schema tools on your computer to create it. You may need to manually add in format of either file-path and directory-path to some parameters if it doesn't do it for you.

Here we will explain how to use the

In the New Schema section, click the blue Submit button to start.
Near the top of the page, click the 'Add group' button. You need at least one group in your schema file to have it function on platform. All parameters must be placed into a group (you can do this by dragging and dropping them into the group). For example you might have one group called Inputs for all your input parameters and a group called Output for your output parameters with the appropriate parameters placed into the correct groups. Click required for every non optional parameter.
The default type of input is a string input. For file and directory path input parameters, click the little wheel to the right

To remove an input parameter for the pipeline from the UI and CLI, you can delete it from the nextflow_schema.json file, or place it in a section of the nextflow_schema.json file that is not referenced in the allOf section at the bottom of the json file.

You can also remove entire sections by removing their reference from the allOf section without deleting them from the file.

Build the nextflow applet

Ensure that you are in the project that you want to build the applet in using dx pwd or dx env. dx select the correct project if required.

Assuming you have the folder called fastqc-nf with these contents (main.nf is required at a minimum):

Build applet - the applet will build in the root of your project

If you are in the fastqc-nf folder on your machine you will need to cd .. back a level for the command below to work

or build using --destination to set the project level folder for the applet

or to build in root of project and just change the name to test-fastqc-nf run

You should see an output like the one below but with a different applet ID.

Use -a with dx build to archive previous versions of your applet and -f to force overwrite previous applet versions. The archived versions are placed in a folder called .Applet_archive in the root of the project.

You can see the build help using dx build -h or dx build --help

How file-path and directory-path in nextflow_schema.json affect run options

In the DNAnexus UI:

file-path will be rendered as a file-picker which enables loading of a file object by selecting it in the UI (can only select one file)
directory-path will be rendered as a string and will appear in the UI as a text box input. You can point to a directory by typing a string path such as dx://<project-id>:/test/ in the box or multiple files in a path such as dx://<project-id>:/test/*_R{1,2}.fastq.gz
string

Here is part of the fastqc-nf run setup screen

Notice how samplesheet has 'Select File' and a file icon but outdir and reads_dir have text input boxes.

-This is because samplesheet was given 'file-path' in the nextflow_schema.json, but outdir and reads_dir were given as directory-path which renders as a string input, hence the text-box.

In the DNAnexus CLI:

Run the applet with -h to see the input parameters for the applet

Excerpt of output from command above

string will appear as class string e.g., for param outdir
The default here is what we specified as the default in nextflow_schema.json. It cannot 'see' the default that we set in the nextflow.config so make sure they match when building the json.
directory-path will appear as class (string) e.g., for param reads_dir

See for more information on options for nextflow_schema.json on DNAnexus.

Running the Nextflow Pipeline Applet

Using samplesheets

When placing a path to a file on the DNAnexus platform in a samplesheet it would use the format of dx://project-xxx:/path/to/file

Here is an example of a samplesheet with one sample (format of samplesheet is determined by you - this is just for illustration purposes)

Run the applet from the UI

In your project on platform, click the fastqc

In the run applet screen, click 'Output to' and choose your output location.
Click 'Next'
At the setup screen, either input a samplesheet or a write the path reads_dir. In the image below, I have used the reads_dir param. Replace 'project-xxx' and '/path/to/reads' with your project-ID and folder name that reads are in.

Review the rest of the inputs and change anything that you want e.g, turn on 'preserve_cache' etc.

Click start analysis

Review the name, output location etc

Click 'Launch Analysis'

Run the applet on the CLI

Running the fastqc applet with the reads_dir as input

I am turning on preserve_cache and using -inextflow_run_opts in the command below for demonstration of how to add them to the command but neither are required here
Note that the *_{1,2}.fastq.gz is needed here for Channel.fromFilePairs to correctly pair up related files
I do not need -profile docker in -inextflow_run_opts

Running the fastqc applet with the samplesheet as input

Notice the different way that the path to the samplesheet is specified compared to the reads_dir in the previous example. You can read more about how this .

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

name

The outfile is set to the file out.txt, which is created by the command block.

nextflow_schema.json

nextflow.config

version 1.0 

workflow hello_world { 
    input { 
        String name 
    }

    call write_greeting { 
        input: greet_name = name 
    }
}

task write_greeting { 
    input {
        String greet_name 
    }

    command { 
        echo 'Hello, ${greet_name}!' 
    }

    output {
        File outfile = stdout() 
    }
}

{
  "hello_world.write_greeting.outfile":
  "/Users/[email protected]/work/srna/wdl_tutorial/hello/
  cromwell-executions/hello_world/7f02fe78-4aff-4e01-95da-c9b6e021773d/
  call-write_greeting/execution/stdout"
}

version 1.0

workflow hello_world {
    input {
        String name
    }

    call write_greeting {
        input: greet_name = name
    }
}

task write_greeting {
    input {
        String greet_name
    }

    command <<< 
        echo 'Hello, ~{greet_name}!' > out.txt 
    >>>

    output {
        File outfile = "out.txt" 
    }
}

{
  "outputs": {
    "hello_world.write_greeting.outfile":
    "/Users/[email protected]/work/srna/wdl_tutorial/hello/
    cromwell-executions/hello_world/1dd3abd8-be70-418b-9a31-b4ea9d5add99/
    call-write_greeting/execution/out.txt"
  },
  "id": "1dd3abd8-be70-418b-9a31-b4ea9d5add99"
}

$ dx run workflow-GFbP9480ff1zVQPG48zXpfzb
Entering interactive mode for input selection.

Input:   stage-common.name (stage-common.name)
Class:   string

Enter string value ('?' for more options)
stage-common.name: Ronald

Select an optional parameter to set by its # (^D or <ENTER> to finish):

 [0] stage-common.overrides___ (stage-common.overrides___)
 [1] stage-common.overrides______dxfiles (stage-common.overrides______dxfiles)
 [2] stage-0.greet_name (stage-0.greet_name) [default={"$dnanexus_link": {"outputField": "name", "stage": "stage-common"}}]
 [3] stage-0.overrides___ (stage-0.overrides___)
 [4] stage-0.overrides______dxfiles (stage-0.overrides______dxfiles)
 [5] stage-outputs.overrides___ (stage-outputs.overrides___)
 [6] stage-outputs.overrides______dxfiles (stage-outputs.overrides______dxfiles)

Optional param #:
The following 1 stage(s) will reuse results from a previous analysis:
  Stage 2: outputs (job-GFbPJx80ff1gYQy5Fg3pK3GY)


Using input JSON:
{
    "stage-common.name": "Ronald"
}

Confirm running the executable with this input [Y/n]: y
Calling workflow-GFbP9480ff1zVQPG48zXpfzb with output destination
  project-GFbKy7Q0ff1k3fGq48ZFZ45p:/

Analysis ID: analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ

$ dx run workflow-GFbP9480ff1zVQPG48zXpfzb -j '{"stage-common.name": "Ronald"}'
-y
The following 3 stage(s) will reuse results from a previous analysis:
  Stage 0: common (job-GFbPjVj0ff1ZypqJ8vQj8kjf)
  Stage 1: write_greeting (job-GFbPjVj0ff1ZypqJ8vQj8kjg)
  Stage 2: outputs (job-GFbPJx80ff1gYQy5Fg3pK3GY)


Using input JSON:
{
    "stage-common.name": "Ronald"
}

Calling workflow-GFbP9480ff1zVQPG48zXpfzb with output destination
  project-GFbKy7Q0ff1k3fGq48ZFZ45p:/

Analysis ID: analysis-GFbPkFj0ff1k3fGq48ZFZ5Jy

$ dx run workflow-GFbP9480ff1zVQPG48zXpfzb -h
usage: dx run workflow-GFbP9480ff1zVQPG48zXpfzb [-iINPUT_NAME=VALUE ...]

Workflow: hello_world

Inputs:
 stage-common
  stage-common.name: -istage-common.name=(string)

 stage-common: Reserved for dxCompiler
  stage-common.overrides___: [-istage-common.overrides___=(hash)]

  stage-common.overrides______dxfiles: [-istage-common.overrides______dxfiles=(>

 stage-0
  stage-0.greet_name: [-istage-0.greet_name=(string, default={"$dnanexus_link":>

 stage-0: Reserved for dxCompiler
  stage-0.overrides___: [-istage-0.overrides___=(hash)]

  stage-0.overrides______dxfiles: [-istage-0.overrides______dxfiles=(file) [-is>

 stage-outputs: Reserved for dxCompiler
  stage-outputs.overrides___: [-istage-outputs.overrides___=(hash)]

  stage-outputs.overrides______dxfiles: [-istage-outputs.overrides______dxfiles>

Outputs:
  stage-common.name: stage-common.name (string)

  stage-0.outfile: stage-0.outfile (file)

Result 1:
ID                    analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ
Class                 analysis
Job name              hello_world
Executable name       hello_world
Project context       project-GFbKy7Q0ff1k3fGq48ZFZ45p
Billed to             org-sos
Workspace             container-GFbPjVj0ff1ZypqJ8vQj8kjb
Workflow              workflow-GFbP9480ff1zVQPG48zXpfzb
Priority              normal
State                 done
Root execution        analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ
Parent job            -
Stage 0               common (stage-common)
  Executable          applet-GFbP93j0ff1py9y87vzB2QQJ
  Execution           job-GFbPjVj0ff1ZypqJ8vQj8kjf (done)
Stage 1               write_greeting (stage-0)
  Executable          applet-GFbP9380ff1XzVKkG9kyVg64
  Execution           job-GFbPjVj0ff1ZypqJ8vQj8kjg (done)
Stage 2               outputs (stage-outputs)
  Executable          applet-GFbP9400ff1pK6v113KJQF9g
  Execution           [job-GFbPJx80ff1gYQy5Fg3pK3GY] (done)
  Cached from         analysis-GFbPJx80ff1gYQy5Fg3pK3GP
Input                 stage-common.name = "Ronald"
                      [stage-0.greet_name = {"$dnanexus_link": {"analysis":
                       "analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ", "stage":
                       "stage-common", "field": "name", "wasInternal": true}}]
Output                stage-common.name = "Ronald"
                      stage-0.outfile = file-GFbPkBj0XFYgB7Vj4pF8XXBQ
Output folder         /
Launched by           kyclark
Created               Wed Aug  3 15:52:55 2022
Finished              Wed Aug  3 15:54:51 2022 (Wall-clock time: 0:01:55)
Last modified         Wed Aug  3 15:54:54 2022
Depends on            -
Tags                  -
Properties            -
Total Price           $0.00
detachedFrom          null
rank                  0
priceComputedAt       1659567291327
currency              {"dxCode": 0, "code": "USD", "symbol": "$",
                      "symbolPosition": "left",
                      "decimalSymbol": ".",
                      "groupingSymbol": ","}
totalEgress           {"regionLocalEgress": 0, "internetEgress": 0,
                      "interRegionEgress": 0}
egressComputedAt      1659567291327
costLimit             null

WORKFLOW = workflow.wdl
PROJECT_ID = project-GFPQvY007GyyXgXGP7x9zbGb
DXCOMPILER = java -jar ~/dxCompiler-2.10.2.jar
CROMWELL = java -jar ~/cromwell-82.jar

check:
    miniwdl check $(WORKFLOW)

local:
    $(CROMWELL) run --inputs inputs.json $(WORKFLOW)

local2:
    $(CROMWELL) run workflow2.wdl

app:
    $(DXCOMPILER) compile $(WORKFLOW) \
        -archive \
        -folder /workflows \
        -project $(PROJECT_ID)

clean:
    rm -rf cromwell-workflow-logs cromwell-executions

// Use newest nextflow dsl - not required to add this line - only dsl2 is supported on DNAnexus
nextflow.enable.dsl = 2

log.info """\
    ===================================
            F A S T Q C - E X A M P L E
    ===================================
    samplesheet : ${params.samplesheet}
    reads_dir   : ${params.reads_dir}
    outdir      : ${params.outdir}
    """
    .stripIndent()


process FASTQC {

    tag "FastQC - ${sample_id}"

    container 'quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0'
    cpus 2
    memory { 4.GB * task.attempt }
    

    publishDir "${params.outdir}", pattern: "*", mode:'copy'

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*"

    script:
    """
     fastqc --threads ${task.cpus} $reads                      
    """
}


/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    MAIN WORKFLOW
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

workflow {
    if (params.samplesheet != null && params.reads_dir == null) {
        
        reads_ch = Channel
            .fromPath(params.samplesheet)
            .splitCsv()
            .map { row -> tuple(row[0], row[1], row[2]) }

            reads_ch.view()
            FASTQC(reads_ch)

    } else if (params.samplesheet == null && params.reads_dir != null) {
        reads_ch = Channel.fromFilePairs(params.reads_dir)

        reads_ch.view()
        FASTQC(reads_ch)

    } else {
        error "Either samplesheet or reads_dir should be provided, not both"
    }
}


workflow.onComplete {
    log.info ( workflow.success ? "\nworkflow is done!\n" : "Oops .. something went wrong" )
}

// Default parameters

docker {
    enabled = true
}

params {
    samplesheet = null
    reads_dir = null
    outdir = "./results"
}

// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
  "title": "Nextflow pipeline parameters",
  "description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
  "type": "object",
  "definitions": {
      "inputs": {
          "title": "Inputs",
          "type": "object",
          "description": "",
          "default": "",
          "properties": {
              "samplesheet": {
                  "type": "string",
                  "description": "Input samplesheet in CSV format",
                  "format": "file-path"
              },
              "reads_dir": {
                "type": "string",
                "description": "Reads directory for file pairs with wildcard",
                "format": "directory-path"
            },             
              "outdir": {
                  "type": "string",
                  "format": "directory-path",
                  "description": "Local path to output directory",
                  "default": "./results"
              }
          }
      }
  },
  "allOf": [
      {
          "$ref": "#/definitions/inputs"
      }
  ]
}

usage: dx run fastqc-nf [-iINPUT_NAME=VALUE ...]

Applet: fastqc-nf

fastqc-nf

Inputs:
  outdir: [-ioutdir=(string)]
        (Nextflow pipeline required) Default value:./results

  reads_dir: [-ireads_dir=(string)]
        (Nextflow pipeline required)

  samplesheet: [-isamplesheet=(file)]
        (Nextflow pipeline required)

        ....

dx run fastqc-nf \
-ireads_dir="dx://project-ID:/FASTQ/*_{1,2}.fastq.gz" \
-ioutdir="./fastqc-out-rd" \
-ipreserve_cache=true \
-inextflow_run_opts='-queue-size 10' \
--destination "project-ID:/USERS/FOLDERNAME" \
--name fastqc-nf-with-reads-dir \
-y

Introduction to CLI

Overview of Interacting with the Platform

Users of the platform like to interact with it in a variety of ways (shown below), but this section is dedicated to those that want to learn how to interact with it using the command line, or CLI.

Terms

The CLI interacts with the platform in the following way:

The CLI (command line interface) is run locally on your own machine.
On your local machine, you will download the SDK (software development kit), which we also call dx-toolkit. Information on downloading it and other requirements is found in the Getting Started Guide. Once set up, this allows you to log into the platform and explore your data/ projects, create apps and workflows, and launch analyses.
API (application programming interface) Servers are used for us to interact with the Platform using HTTP requests. The arguments for this request are fields in a JSON file. If you want more details on this structure, you can go to .

Installation

Please ensure that you are running Python 3 before starting this install.

To install:

To upgrade dxpy

Further details can be found in our if you need it.

Introducing dx-toolkit

The dx command will be your most used utility for interacting with the DNAnexus platform. You can run the command with no arguments or with the -h or --help flags to see the usage:

Sometime the usage make occupy your entire terminal, in which case you may see (END) to show that you are at the end of the documentation. Press q to quit the usage, or use the universal Ctrl-C to send an interrupt signal to the process to kill it.

Run dx help to read about the categories of commands you can run:

Logging Into the Platform

Let's start by using dx login to gain access to the DNAnexus platform from the command line. All dx commands will respond to -h|--help, so run the command with one of these flags to read the usage:

The help documentation is often called the usage because that is often the first word of the output. In the previous output, notice that the all the arguments are enclosed in square brackets, e.g., [--token TOKEN]. This is a common convention in Unix documentation to indicate that the argument is optional. The lack of such square brackets means the argument is required.

Some of the arguments require a value to follow. For example, --token TOKEN means the argument --token must be followed by the string value for the token. Arguments like --save are known as flags. They are either present or not and often represent a Boolean value, usually "True" when present and "False" when absent.

The most basic usage for login is to enter your username and password when prompted:

TODO: Reasons for using tokens, security, dangers. You may also generate a token in the web UI for use on the command line:

Information on setting up tokens can be found in the section of our Documentation.

Use dx logout to log out of the platform. This invalidates a token.

If you are ever in doubt of your username, use dx whoami to see your identity.

When you ssh into a cloud workstation, you will be your normal DNAnexus user.
When running the ttyd app to access a cloud workstation through the UI, you will be the privileged Unix user root.
When you ssh into a running job, you will be the user dnanexus.

Working with Projects and Users

A project is the smallest unit of sharing in DNAnexus, and you must always work in the context of a project. Upon login, you will be prompted to select a project. To change projects, use dx select. Use -h|--help to view the usage:

When run with no options, you will be presented a list of your projects and privilege:

Press Enter to choose the first project, or select a number 0-9 to choose a project or m for "more" options. You can also provide a project name or ID as the first argument:

Use the --level option to specify only projects where you have a particular permission. For instance, dx select --level ADMINISTER will show only projects where you are an administrator.

Normally, projects are private to your organization, but the --public option will display the public projects that DNAnexus uses to share common resources like sequence files or indexes for reference genomes:

Press Ctrl-C to exit the program without making a selection.

If you are ever in doubt as to your current project, run dx pwd (print working directory):

Alternatively, you can run dx env to see your current environment:

If I wanted to share some data with a collaborator, I would use dx new project to create a new project to hold select data and apps. Following is the usage:

I will use this command to create a new project in the AWS US-East-1 region. See the documentation for a list of . The command displays the new project ID and prompts to switch into the new project:

Next, I would use dx invite <user-id> to invite users to the project. Start with the usage to see how to call the command:

The usage to see that this command includes three positional arguments, the first of which (invitee) is required and the other two (project, permissions) are optional. Your currently selected project is the default project, and "VIEW" is the default permission. If you wish to indicate some permission other than "VIEW," you must specify the project first.

Use dx uninvite <user-id> to revoke a user's access to a project:

Data Exploration

Earlier, I introduced dx pwd to print working directory to find my currently selected project.

Notice that the output shows the project name and the directory /, which is the root directory of the project:

The command dx ls will list the contents of a directory. Notice in the usage that the directory name is optional, in which case it will use the current working directory:

There is nothing to list because I just created this project, so I'll add some data next.

Copying and Moving Files

I will use the command dx cp to copy a small file from one of the public projects into my project. I'll start with the usage:

The usage shows source [source …], which is another Unix convention to indicate that the argument may be repeated. This means you can indicate several source files or directories to be copied to the final destination.

I'll copy the file hs38DH.dict from the project "Reference Genome Files: AWS US (East)" into the root directory of my new project. The command will only produce output on error:

I must specify the source file using the project and file ID. When you refer to files inside your current project, it's only necessary to use the file ID.

Now I can list the one file:

Often you'll want to use the file ID, which you can view using the -l|--long flag to see the long listing that includes more metadata:

I've decided I want to create a data directory to hold files such as this, so I will use dx mkdir data. The command will produce no output on success. A new listing shows data/ where the trailing slash indicates this is a directory:

To move the hs38DH.dict into the data directory, I can either use the file name or ID:

A new listing shows that the file is no longer in the root directory:

I can specify the data directory to view the contents:

Alternatively, I can use dx cd data to change directories. The command dx pwd will verify that I'm in the new folder:

If I execute dx ls now, I'll see the contents of the data directory:

Return to the root directory of the project by runing dx cd or dx cd /.

Another way to inspect the structure of a project is using dx tree:

With no options, you will see a tree structure of the project:

This command will also show the long listing with -l|--long:

Uploading Data

I want to create a local file on my computer and add it to the project. I'll use the echo command to redirect some text into a file:

I'll use the dx upload command. The usage shows that filename is required and may be repeated.

There are many options to the command, and here are a few to highlight:

--brief: Display a brief version of the return value; for most commands, prints a DNAnexus ID per line
-r, --recursive: Upload directories recursively
--path [PATH], --destination [PATH]: DNAnexus path to upload file(s) to (default uses current project and folder if not provided)

Run dx upload hello.txt and see that the new file exists in the root directory of your current project:

You can also upload data using the UI. Under the "Add" menu, you will find the following:

Upload Data: Use your browser to add files to the project. This is the same as using dx upload.
Copy Data From Project: Add data from existing projects on the platform. This is the same as dx cp.
Add Data From Server: Add data from any publicly accessible URL such as an HTTP or FTP site. This is the same as running the app.

In addition, we offer an app.

I would like to check the new file on the platform. The dx cat command will, like the Unix cat concatenate command, print the entire contents of a file to the console:

I can use this to verify that the file was correctly uploaded:

You might expect the following command to upload hello.txt into the data directory:

Unfortunately, this will create a file called data alongside a directory called data:

I can verify that the data file contains "hello":

Note this important part of upload's usage:

This brings up an interesting point that file names are not unique on the DNAnexus platform. The only unique identifier is the file ID, and so this is always the best way to refer to a file. To rectify the duplication, I will get the file ID:

I can remove the file using dx rm file-GXZB2180fF65j2G1197pP7By.

If I dx upload hello.txt file again, I will not overwrite the existing file. Rather, another copy of the file will be created with a new file ID:

The concept of immutability was covered in "Course 101 Overview of the DNA nexus Platfrom USer Interface": Remember the crucially important fact that data objects on the DNAnexus platform are immutable. They can only be created (e.g., by uploading them) or removed, but they can never be overwritten. A given object ID always points to the same collection of bits, which leads to downstream benefits like reusing the outputs of jobs that share the same executable and input IDs ().

I cannot remove the file by filename as it's not unique, so I'm prompted to select which file I want:

I used dx cat hello.txt to read the contents of the entire file because I knew the file had only one line. It's far safer to use dx head to look at just the first few lines (the default is 10):

For instance, I can peek at the data/hs38DH.dict file:

Another option to check the file is to download it:

Inspecting Object Metadata

Every data object on the platform has a unique identifier prefixed with the type of object such as "file-," "record-," or "applet-." Earlier, I saw that hello.txt has the ID file-GXZB1v80fF6BXJ8p7PvZPy1v. I can use the dx describe command to view the metadata:

I could use the filename, if it's unique, but it's always best practice to use the file ID:

As shown in the usage, the --delim option causes the output table to use whatever delimiter you want between the columns. This could be useful if you wish to parse the output programmatically. The tab character is the default delimiter, but I can use a comma like so:

The --json flag returns the same data in JavaScript Object Notation (JSON), which we'll discuss in a later chapter:

I can use dx describe to view the metadata associated with any object identifer on the platform. For instance, I'll use head to view the first few lines of the project's metadata:

Find another entity ID, such as your billing org, to use with the command.

Copying and Moving Files

I can use dx mv to move a file or directory within a project:

For instance, I can rename hello.txt to goodbye.txt with the command dx mv hello.txt goodbye.txt. The file ID remains the same:

I can also move goodbye.txt to the data directory and rename it back to hello.txt. Again, the file ID remains the same because I have only changed some of the file's metadata:

As noted in the preceeding usage, I should use dx cp to copy data from one project to another. If I attempt to copy a file within a project, I will get an error:

The only way to make an actual copy of a file is to upload it again as I did earlier when I added the hello.txt file a second time.

Data objects on the platform exist as bits in AWS or Azure storage, and the associated metadata is stored in a DNAnexus database. If two projects are in the same region such as AWS US-East-1, then dx cp doesn't actually copy the bits but rather creates a new database entry pointing to the object. This means you don't pay for additional storage. Copying between regions, however, does make a physical copy of the bits and will cost money for data egress and storage. When in doubt, use dx describe <project-id> to see a project's "Region" attribute or check the "Settings" in the project view UI.

Finding Data

The dx find command will help you search for entities including:

apps
globalworkflows
jobs
data

I can use the dx find data command to search data objects such as files and applets. I'll display the first part of the usage as it's rather long:

Run the command in the current project to see the two files:

I can use the --name option to look for a file by name:

I can also specify a Unix file glob pattern, such as all files that begin with h:

Or all files that end with .dict. Note in this example that the asterisk is escapted with a backslash to prevent my shell from exanding it locally as I want the literal star to be given as the argument:

The --brief flag will return only the file ID:

This is useful, for instance, for downloading a file:

The --json flag will return the results in JSON format. In the JSON chapter, you will learn how to parse these results for more advanced querying and data manipulation:

The --class option accepts the following values:

applet
database
file
record

The --state options accepts the following values:

open: A file that is currently being uploaded
closing: A file that is done uploading but is still being validated
closed: A file that is uploaded and validated

There are many more options for finding data and other entities on the platform that will be covered in later chapters.

Running Jobs

It's time to run an app, but which one? I'd like to have a FASTQ file to work with, so I'll start by using the SRA FASTQ Importer. I can never quite remember the name of the app, so I'll search for it using a wildcard:

The "x" in the first column indicates this is an app supported by DNAnexus.

I can find information about the inputs and outputs to the app using either of these commands:

dx describe sra_fastq_importer
dx run sra_fastq_importer -h

I prefer the output from the second command:

Looking at the usage for the app, I see that only the -iaccession argument is required as all the others are shown enclosed with square brackets, e.g., [-ingc_key=(file)]. I can run the app the SRA accession (C. elegans), answering "yes" to both launching and watching the app:

The equal sign in -iaccession=SRR070372 is required.

The output of watching is the same as you would see from the UI if you click the "MONITOR" tab in the project view and then "View Log" while the app is running. The end of the watch shows the app ran successfully and that a new file was created in my project:

I can find the size of the file with dx ls:

Now I'd like to run this into FastQC. I'll search for the app by name just to be sure, and, yes, it's called "fastqc":

Again, I use either dx describe or dx run to see that the app requires

I will use the new file's ID as the input to FastQC, and I'll run it using the additional flags -y to confirm launching and --watch to immediately start watching the job:

Notice that the confirmation shows "Using input JSON". If you like, you can save that to a file called, for example, input.json:

I can then launch the job using the -f|--input-json-file argument along with the --brief flag to show only the resulting job ID:

Since the output will be the same, I can kill the job using dx terminate job-GXf930j071xJfYqfJ2kkvk8v.

The end of the watch shows that the job finishes successfully:

I would like to get a feel for the output, so I'll use dx head on the stats_txt output file ID:

Review

You are now able to:

List the advantages to interacting with platform via command line interface
List the functions of the SDK and the API
Describe the purpose of the dx-toolkit
Apply frequently used dx-toolkit commands to execute common use cases, applicable to a broad audience of users

Resources

To create a support ticket if there are technical issues:

Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

any: any of the above

usage: dx [-h] [--version] command ...

DNAnexus Command-Line Client, API v1.0.0, client v0.346.0

dx is a command-line client for interacting with the DNAnexus platform.  You
can log in, navigate, upload, organize and share your data, launch analyses,
and more.  For a quick tour of what the tool can do, see

  https://documentation.dnanexus.com/getting-started/tutorials/cli-quickstart#q>

For a breakdown of dx commands by category, run "dx help".

dx exits with exit code 3 if invalid input is provided or an invalid operation
is requested, and exit code 1 if an internal error is encountered.  The latter
usually indicate bugs in dx; please report them at

  https://github.com/dnanexus/dx-toolkit/issues

options:
  -h, --help  show this help message and exit
  --env-help  Display help message for overriding environment
              variables
  --version   show program's version number and exit

$ dx help
usage: dx help [-h] [command_or_category] [subcommand]

Displays the help message for the given command (and subcommand if given), or
displays the list of all commands in the given category.

CATEGORIES

  all       All commands
  session   Manage your login session
  fs        Navigate and organize your projects and files
  data      View, download, and upload data
  metadata  View and modify metadata for projects, data, and executions
  workflow  View and modify workflows
  exec      Manage and run apps, applets, and workflows
  org       Administer and operate on orgs
  other     Miscellaneous advanced utilities

$ dx login -h
usage: dx login [-h] [--env-help] [--token TOKEN] [--noprojects] [--save]
                [--timeout TIMEOUT]

Log in interactively and acquire credentials. Use "--token" to log in with an
existing API token.

options:
  -h, --help         show this help message and exit
  --env-help         Display help message for overriding environment variables
  --token TOKEN      Authentication token to use
  --noprojects       Do not print available projects
  --save             Save token and other environment variables for future
                     sessions
  --timeout TIMEOUT  Timeout for this login token (in seconds, or use suffix
                     s, m, h, d, w, M, y)

$ dx select -h
usage: dx select [-h] [--env-help] [--name NAME]
                 [--level {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}] [--public]
                 [project]

Interactively list and select a project to switch to. By default, only lists
projects for which you have at least CONTRIBUTE permissions. Use --public to
see the list of public projects.

positional arguments:
  project               Name or ID of a project to switch to; if not provided
                        a list will be provided for you

options:
  -h, --help            show this help message and exit
  --env-help            Display help message for overriding environment
                        variables
  --name NAME           Name of the project (wildcard patterns supported)
  --level {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}
                        Minimum level of permissions expected
  --public              Include ONLY public projects (will automatically set
                        --level to VIEW)

$ dx select

Note: Use dx select --level VIEW or dx select --public to
select from projects for which you only have VIEW permissions.

Available projects (CONTRIBUTE or higher):
0) App Dev (ADMINISTER)
1) Methylation (ADMINISTER)
2) Genomes (ADMINISTER)
3) WTS (ADMINISTER)
4) WGS (ADMINISTER)
5) Exome (ADMINISTER)
6) QC (ADMINISTER)
7) Collaborators (ADMINISTER)
8) Pipeline Dev (ADMINISTER)
9) WDL Test (ADMINISTER)
m) More options not shown...

Pick a numbered choice or "m" for more options [0]:

$ dx select --public

Available public projects:
0) Reference Genome Files: Azure US (West) (VIEW)
1) App_Assets_Europe(London)_Internal (VIEW)
2) Reference Genome Files: Azure Amsterdam (VIEW)
3) Reference Genome Files: AWS Germany (VIEW)
4) Reference Genome Files: AWS US (East) (VIEW)
5) Reference Genome Files: AWS Europe (London) (VIEW)
6) App and Applet Assets Azure (VIEW)
7) dxCompiler_Europe_London (VIEW)
8) dxCompiler_Sydney (VIEW)
9) dxCompiler_Berlin (VIEW)
m) More options not shown...

Pick a numbered choice or "m" for more options:

$ dx env
Auth token used         XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
API server protocol     https
API server host         api.dnanexus.com
API server port         443
Current workspace       project-XXXXXXXXXXXXXXXXXXXXXXXX
Current workspace name  "Pipeline Dev"
Current folder          /
Current user            test_user

$ dx new project -h
usage: dx new project [-h] [--brief | --verbose] [--env-help]
                      [--region REGION] [-s] [--bill-to BILL_TO] [--phi]
                      [--database-ui-view-only]
                      [name]

Create a new project

positional arguments:
  name                  Name of the new project

options:
  -h, --help            show this help message and exit
  --brief               Display a brief version of the return value; for most
                        commands, prints a DNAnexus ID per line
  --verbose             If available, displays extra verbose output
  --env-help            Display help message for overriding environment
                        variables
  --region REGION       Region affinity of the new project
  -s, --select          Select the new project as current after creating
  --bill-to BILL_TO     ID of the user or org to which the project will be
                        billed. The default value is the billTo of the
                        requesting user.
  --phi                 Add PHI protection to project
  --database-ui-view-only
                        Viewers on the project cannot access database data
                        directly

$ dx invite -h
usage: dx invite [-h] [--env-help] [--no-email]
                 invitee [project] [{VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}]

Invite a DNAnexus entity to a project. If the invitee is not recognized as a
DNAnexus ID, it will be treated as a username, i.e. "dx invite alice : VIEW"
is equivalent to inviting the user with user ID "user-alice" to view your
current default project.

positional arguments:
  invitee               Entity to invite
  project               Project to invite the invitee to
  {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}
                        Permissions level the new member should have

options:
  -h, --help            show this help message and exit
  --env-help            Display help message for overriding environment
                        variables
  --no-email            Disable email notifications to invitee

$ dx uninvite -h
usage: dx uninvite [-h] [--env-help] entity [project]

Revoke others' permissions on a project you administer. If the entity is not
recognized as a DNAnexus ID, it will be treated as a username, i.e. "dx
uninvite alice :" is equivalent to revoking the permissions of the user with
user ID "user-alice" to your current default project.

positional arguments:
  entity      Entity to uninvite
  project     Project to revoke permissions from

options:
  -h, --help  show this help message and exit
  --env-help  Display help message for overriding environment variables

$ dx pwd -h
usage: dx pwd [-h] [--env-help]

Print current working directory

options:
  -h, --help  show this help message and exit
  --env-help  Display help message for overriding environment variables

$ dx ls -h
usage: dx ls [-h] [--color {off,on,auto}] [--delimiter [DELIMITER]]
             [--env-help] [--brief | --verbose] [-a] [-l] [--obj] [--folders]
             [--full]
             [path]

List folders and/or objects in a folder

positional arguments:
  path                  Folder (possibly in another project) to list the
                        contents of, default is the current directory in the
                        current project. Syntax: projectID:/folder/path

usage: dx cp [-h] [--env-help] [-a] source [source ...] destination

Copy objects and/or folders between different projects.  Folders will
automatically be copied recursively.  To specify which project to use as a
source or destination, prepend the path or ID of the object/folder with the
project ID or name and a colon.

EXAMPLES

  The first example copies a file in a project called "FirstProj" to the
  current directory of the current project.  The second example copies the
  object named "reads.fq.gz" in the current directory to the folder
  /folder/path in the project with ID "project-B0VK6F6gpqG6z7JGkbqQ000Q",
  and finally renaming it to "newname.fq.gz".

  $ dx cp FirstProj:file-B0XBQFygpqGK8ZPjbk0Q000q .
  $ dx cp reads.fq.gz project-B0VK6F6gpqG6z7JGkbqQ000Q:/folder/path/newname.fq.>

positional arguments:
  source       Objects and/or folder names to copy
  destination  Folder into which to copy the sources or new pathname (if only
               one source is provided).  Must be in a different
               project/container than all source paths.

options:
  -h, --help   show this help message and exit
  --env-help   Display help message for overriding environment
               variables
  -a, --all    Apply to all results with the same name without
               prompting

$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /
State   Last modified       Size      Name (ID)
closed  2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)

$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /data
State   Last modified       Size      Name (ID)
closed  2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)

$ dx tree -h
usage: dx tree [-h] [--color {off,on,auto}] [--env-help] [-a] [-l] [path]

List folders and objects in a tree

positional arguments:
  path                  Folder (possibly in another project) to list the
                        contents of, default is the current directory in the
                        current project. Syntax: projectID:/folder/path

options:
  -h, --help            show this help message and exit
  --color {off,on,auto}
                        Set when color is used (color=auto is used when stdout
                        is a TTY)
  --env-help            Display help message for overriding environment
                        variables
  -a, --all             show hidden files
  -l, --long            use a long listing format

$ dx upload -h
usage: dx upload [-h] [--visibility {hidden,visible}] [--property KEY=VALUE]
                 [--type TYPE] [--tag TAG] [--details DETAILS] [-p]
                 [--brief | --verbose] [--env-help] [--path [PATH]] [-r]
                 [--wait] [--no-progress] [--buffer-size WRITE_BUFFER_SIZE]
                 [--singlethread]
                 filename [filename ...]

Upload local file(s) or directory. If "-" is provided, stdin will be used
instead. By default, the filename will be used as its new name. If
--path/--destination is provided with a path ending in a slash, the filename
will be used, and the folder path will be used as a destination. If it does not
end in a slash, then it will be used as the final name.

positional arguments:
  filename              Local file or directory to upload ("-" indicates stdin
                        input); provide multiple times to upload multiple files
                        or directories

$ dx cat -h
usage: dx cat [-h] [--env-help] [--unicode] path [path ...]

positional arguments:
  path        File ID or name(s) to print to stdout

options:
  -h, --help  show this help message and exit
  --env-help  Display help message for overriding environment variables
  --unicode   Display the characters as text/unicode when writing to stdout

If --path/--destination is provided with a path ending in a slash, the
filename will be used, and the folder path will be used as a destination.
If it does not end in a slash, then it will be used as the final name.

$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /
data/
State   Last modified       Size      Name (ID)
closed  2023-07-07 16:34:31 6 bytes   data (file-GXZB2180fF65j2G1197pP7By)
closed  2023-07-07 16:34:10 6 bytes   hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)

$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /
data/
State   Last modified       Size      Name (ID)
closed  2023-07-07 17:01:20 6 bytes   hello.txt (file-GXZBKYQ0fF6Pf2ZKPBF7G7j9)
closed  2023-07-07 16:34:10 6 bytes   hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)

$ dx rm hello.txt
The given path "hello.txt" resolves to the following data objects:
0) closed  2023-07-07 17:01:20 6 bytes   hello.txt (file-GXZBKYQ0fF6Pf2ZKPBF7G7j9)
1) closed  2023-07-07 16:34:10 6 bytes   hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)

Pick a numbered choice or "*" for all: 0

$ dx head -h
usage: dx head [-h] [--color {off,on,auto}] [--env-help] [-n N] path

Print the first part of a file. By default, prints the first 10 lines.

positional arguments:
  path                  File ID or name to access

options:
  -h, --help            show this help message and exit
  --color {off,on,auto}
                        Set when color is used (color=auto is used when stdout
                        is a TTY)
  --env-help            Display help message for overriding environment
                        variables
  -n N, --lines N       Print the first N lines (default 10)

$ dx head data/hs38DH.dict
@HD VN:1.6
@SQ SN:chr1 LN:248956422    M5:6aef897c3d6ff0c78aff06ac189178dd UR:file:/home/hs38DH.fa.gz
@SQ SN:chr2 LN:242193529    M5:f98db672eb0993dcfdabafe2a882905c UR:file:/home/hs38DH.fa.gz
@SQ SN:chr3 LN:198295559    M5:76635a41ea913a405ded820447d067b0 UR:file:/home/hs38DH.fa.gz
@SQ SN:chr4 LN:190214555    M5:3210fecf1eb92d5489da4346b3fddc6e UR:file:/home/hs38DH.fa.gz
@SQ SN:chr5 LN:181538259    M5:a811b3dc9fe66af729dc0dddf7fa4f13 UR:file:/home/hs38DH.fa.gz
@SQ SN:chr6 LN:170805979    M5:5691468a67c7e7a7b5f2a3a683792c29 UR:file:/home/hs38DH.fa.gz
@SQ SN:chr7 LN:159345973    M5:cc044cc2256a1141212660fb07b6171e UR:file:/home/hs38DH.fa.gz
@SQ SN:chr8 LN:145138636    M5:c67955b5f7815a9a1edfaa15893d3616 UR:file:/home/hs38DH.fa.gz
@SQ SN:chr9 LN:138394717    M5:6c198acf68b5af7b9d676dfdd531b5de UR:file:/home/hs38DH.fa.gz

$ dx download file-GFz5xf00Bqx2j79G4q4F5jXV
[===========================================================>]
Downloaded 342,714
[===========================================================>]
Completed 342,714 of 342,714 bytes (100%) /Users/[email protected]/work/academy/hs38DH.dict

$ dx describe -h
usage: dx describe [-h] [--json] [--color {off,on,auto}]
                   [--delimiter [DELIMITER]] [--env-help] [--details]
                   [--verbose] [--name] [--multi]
                   path

Describe a DNAnexus entity.  Use this command to describe data objects by name
or ID, jobs, apps, users, organizations, etc.  If using the "--json" flag, it
will thrown an error if more than one match is found (but if you would like a
JSON array of the describe hashes of all matches, then provide the "--multi"
flag).  Otherwise, it will always display all results it finds.

NOTES:

- The project found in the path is used as a HINT when you are using an object ID;
you may still get a result if you have access to a copy of the object in some
other project, but if it exists in the specified project, its description will
be returned.

- When describing apps or applets, options marked as advanced inputs will be
hidden unless --verbose is provided

positional arguments:
  path                  Object ID or path to an object (possibly in another
                        project) to describe.

options:
  -h, --help            show this help message and exit
  --json                Display return value in JSON
  --color {off,on,auto}
                        Set when color is used (color=auto is used when stdout
                        is a TTY)
  --delimiter [DELIMITER], --delim [DELIMITER]
                        Always use exactly one of DELIMITER to separate fields
                        to be printed; if no delimiter is provided with this
                        flag, TAB will be used
  --env-help            Display help message for overriding environment
                        variables
  --details             Include details of data objects
  --verbose             Include additional metadata
  --name                Only print the matching names, one per line
  --multi               If the flag --json is also provided, then returns a JSON
                        array of describe hashes of all matching results

$ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v
Result 1:
ID                          file-GXZB1v80fF6BXJ8p7PvZPy1v
Class                       file
Project                     project-GXZ90x00fF6F4fy1K20x4gv9
Folder                      /
Name                        hello.txt
State                       closed
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Fri Jul  7 16:34:09 2023
Created by                  kyclark
Last modified               Fri Jul  7 16:34:10 2023
Media type                  text/plain
archivalState               "live"
Size                        6 bytes
cloudAccount                "cloudaccount-dnanexus"

$ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v --delim ,
Result 1:
ID,file-GXZB1v80fF6BXJ8p7PvZPy1v
Class,file
Project,project-GXZ90x00fF6F4fy1K20x4gv9
Folder,/
Name,hello.txt
State,closed
Visibility,visible
Types,-
Properties,-
Tags,-
Outgoing links,-
Created,Fri Jul  7 16:34:09 2023
Created by,kyclark
Last modified,Fri Jul  7 16:34:10 2023
Media type,text/plain
archivalState,"live"
Size,6 bytes
cloudAccount,"cloudaccount-dnanexus"

$ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v --json
{
    "id": "file-GXZB1v80fF6BXJ8p7PvZPy1v",
    "project": "project-GXZ90x00fF6F4fy1K20x4gv9",
    "class": "file",
    "sponsored": false,
    "name": "hello.txt",
    "types": [],
    "state": "closed",
    "hidden": false,
    "links": [],
    "folder": "/",
    "tags": [],
    "created": 1688772849000,
    "modified": 1688772850572,
    "createdBy": {
        "user": "user-kyclark"
    },
    "properties": {},
    "details": {},
    "media": "text/plain",
    "archivalState": "live",
    "size": 6,
    "cloudAccount": "cloudaccount-dnanexus"
}

$ dx describe project-GXZ90x00fF6F4fy1K20x4gv9 | head
Result 1:
ID                          project-GXZ90x00fF6F4fy1K20x4gv9
Class                       project
Name                        demo_project
Summary
Billed to                   org-sos
Access level                ADMINISTER
Region                      aws:us-east-1
Protected                   false
Restricted                  false

$ dx mv -h
usage: dx mv [-h] [--env-help] [-a] source [source ...] destination

Move or rename data objects and/or folders inside a single project.  To copy
data between different projects, use 'dx cp' instead.

positional arguments:
  source       Objects and/or folder names to move
  destination  Folder into which to move the sources or new pathname (if only
               one source is provided).  Must be in the same project/container
               as all source paths.

options:
  -h, --help   show this help message and exit
  --env-help   Display help message for overriding environment
               variables
  -a, --all    Apply to all results with the same name without
               prompting

$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /
data/
State   Last modified       Size      Name (ID)
closed  2023-07-10 10:11:31 6 bytes   goodbye.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)

$ dx mv file-GXZB1v80fF6BXJ8p7PvZPy1v data/hello.txt
$ dx tree -l
.
└── data
    ├── closed  2023-07-10 10:13:31 6 bytes   hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
    └── closed  2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)

$ dx cp hello.txt data/hello_copy.txt
dxpy.exceptions.DXCLIError: A source path and the destination path resolved
to the same project or container. Please specify different source and
destination containers, e.g.
dx cp source-project:source-id-or-path dest-project:dest-path

usage: dx find data [-h] [--brief | --verbose] [--json]
                    [--color {off,on,auto}] [--delimiter [DELIMITER]]
                    [--env-help] [--property KEY[=VALUE]] [--tag TAG]
                    [--class {record,file,applet,workflow,database}]
                    [--state {open,closing,closed,any}]
                    [--visibility {hidden,visible,either}] [--name NAME]
                    [--type TYPE] [--link LINK] [--all-projects]
                    [--path PROJECT:FOLDER] [--norecurse]
                    [--created-after CREATED_AFTER]
                    [--created-before CREATED_BEFORE] [--mod-after MOD_AFTER]
                    [--mod-before MOD_BEFORE] [--region REGION]

Finds data objects subject to the given search parameters. By default,
restricts the search to the current project if set. To search over all
projects (excluding public projects), use --all-projects (overrides --path and
--norecurse).

$ dx find data --name "h*"
closed  2023-07-10 10:13:31 6 bytes   /data/hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
closed  2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)

$ dx download $(dx find data --name \*.dict --brief)
[=======================>] Completed 342,714 of 342,714 bytes (100%)
                           /Users/[email protected]/work/academy/hs38DH.dict

$ dx find data --name \*.dict --json
[
    {
        "project": "project-GXZ90x00fF6F4fy1K20x4gv9",
        "id": "file-GFz5xf00Bqx2j79G4q4F5jXV",
        "describe": {
            "id": "file-GFz5xf00Bqx2j79G4q4F5jXV",
            "project": "project-GXZ90x00fF6F4fy1K20x4gv9",
            "class": "file",
            "name": "hs38DH.dict",
            "state": "closed",
            "folder": "/data",
            "modified": 1688771516882,
            "size": 342714
        }
    }
]

$ dx run sra_fastq_importer -h
usage: dx run sra_fastq_importer [-iINPUT_NAME=VALUE ...]

App: SRA FASTQ Importer

Version: 4.0.0 (published)

Download SE or PE reads in FASTQ or FASTA format from SRA using SRR accessions

See the app page for more information:
  https://platform.dnanexus.com/app/sra_fastq_importer

Inputs:
  dbGaP Repository key: [-ingc_key=(file)]
        (Optional) Security token required for configuring NCBI SRA toolkit and decryption tools.

  SRR Accession: -iaccession=(string)
        Single SRR accession to fetch.

$ dx run sra_fastq_importer -iaccession=SRR070372

Using input JSON:
{
    "accession": "SRR070372"
}

Confirm running the executable with this input [Y/n]: y
Calling app-G49BFZ093qKvjFYgF8fyv6Z7 with output destination project-GXY0PK0071xJpG156BFyXpJF:/

Job ID: job-GXf8Qg8071xBJJg417YVYJX3
Watch launched job now? [Y/n] y

* SRA FASTQ Importer (sra_fastq_importer:main) (done)
  job-GXf8Qg8071xBJJg417YVYJX3
  kyclark 2023-07-10 15:38:21 (runtime 0:02:36)
  Output: single_reads_fastq = [ file-GXf8VgQ09bzK5q1XV5z1gx7j ]

usage: dx run fastqc [-iINPUT_NAME=VALUE ...]

App: FastQC Reads Quality Control

Version: 3.0.3 (published)

Generates a QC report on reads data

See the app page for more information:
  https://platform.dnanexus.com/app/fastqc

Inputs:
  Reads: -ireads=(file)
        A file containing the reads to be checked. Accepted formats are
        gzipped-FASTQ and BAM.

$ dx run fastqc -ireads=file-GXf8P880FjgZGJQqx8Bf30YK -y --watch

Using input JSON:
{
    "reads": {
        "$dnanexus_link": "file-GXf8P880FjgZGJQqx8Bf30YK"
    }
}

Calling app-G81jg5j9jP7qxb310vg2xQkX with output destination project-GXY0PK0071xJpG156BFyXpJF:/

Job ID: job-GXf8fJQ071x00P5bQzQ62gjY

* FastQC Reads Quality Control (fastqc:main) (done) job-GXf8fgj071x3KV4qyyKGZQVY
  kyclark 2023-07-10 15:51:11 (runtime 0:02:01)
  Output: report_html = file-GXf8gbQ06GxZ38zFXB46XYYj
          stats_txt = file-GXf8gbj06Gxy9F8P66pJG7J3

$ dx head file-GXf8gbj06Gxy9F8P66pJG7J3
##FastQC    0.11.9
>>Basic Statistics    pass
#Measure    Value
Filename    SRR070372.fastq.gz
File type   Conventional base calls
Encoding    Sanger / Illumina 1.9
Total Sequences 498843
Sequences flagged as poor quality   0
Sequence length 48-2044
%GC 39