Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 114 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

Academy Documentation

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Usage of Academy Documentation

Welcome to DNAnexus Academy's online guidebook! This resource is designed for educational purposes to provide you with a foundational understanding of how to utilize DNAnexus for performing analyses. Please note that this guide does not aim to instruct you on every aspect of using the platform, nor does it suggest that this is the only method for leveraging DNAnexus solutions. Instead, it serves as an instructional tool with examples designed to help you begin your journey.

Included in this documentation are guides to assist with your projects, including videos, and content for the terms and concepts that we think are important for your understanding. There are also walk-through examples to get you comfortable on the platform.

As-Is Software Disclaimer This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to materials provided hereunder.

Getting Started

For Scientists

If you are new to the DNAnexus platform and computational biology/ bioinformatics, these sections are recommended for you:

Background InformationGeneral InformationCloud Computing for ScientistsOverview of the PlatformFor Titan UsersFor Apollo Users

Cloud Computing

Background Information

Welcome to DNAnexus!

Before you go through the information here, there is necessary information that we think will be useful for you to have.

Some of the users of the platform have limited coding experience. As bioinformaticians and computational biologists, we are members of a community that want to help alleviate that stress. On this page, we attached some helpful links and tutorials that will hopefully help make the world of computational biology a bit less intimidating. This is not a partnership or affiliation, but rather a list of what we found useful when we were learning ourselves.

Additionally, users may need resources on the different types of sequencing and the impacts, and we have some here for the ever evolving field of genetics/ genomics. Again, these are not endorsing any particular company, lab, or resource, but instead more of a general guide to help fill in the gaps.

Programming Languages

Bash

  • Ten Simple Rules for Getting Started with Command-Line Bioinformatics

  • The Unix Shell

  • Unix Command Cheatsheet

  • Bash for Bioinformatics

Python

  • Programming with Python

  • Plotting and Programming in Python

  • Learn Python

R

  • Ten Simple Rules for Teaching Yourself R

  • Programming with R

  • R for Reproducible Scientific Analysis

  • Swirl for R

Reproducible Research

  • The Five Pillars of Computational Reproducibility: Bioinformatics and Beyond

  • Version Control with Git

Biology Concepts

  • Cancer Biology Medicine: Next Generation Sequencing and its Clinical Application

  • A Beginner's Guide to Analysis of RNA Sequencing Data

Cohort Browser

Building Applets

Billing Access and Orgs

General Information

Instance Type Overview

Naming

AWS naming of instance types is broken down here:

Germline Data Exploration

Please note, the data present in this page is synthetic data, and is intended for training purposes only. Information about the data present in this documentation is listed.

When germline variant data is present in your data ingestion for the cohort, the Germline Variants tab will appear in the Cohort Browser. The goal of viewing data within the Germline Variants tab is to view germline mutations in genes or genomic regions of interest.

Features of the Germline Variants Tab

Phenotypic Filtering

To filter with phenotypic data, you can filter from the tiles that you added in the “Overview” tab, or through the “+ Add Filter” button in the Cohort Banner. These filters allow for assessing the impact of phenotypic/ clinical data and the creation of cohorts.

How to Filter a Dataset with Phenotypic Filters

1. In the Cohort section, select the “+ Add Filter”

2. Search or select your characteristic. Ex: Diagnoses, Tumor Details > Tumor Disease Anatomic Site

For Titan Users

If you are a Titan user, these sections are recommended for you:

Any background information that could be necessary are listed in the For HPC or For Scientists pages to get you started there as well.

Overview of the Platform
Billing Access and Orgs
Command Line Interface (CLI)
Choosing an Instance Type
Question
Focus on this

Does the software utilize multiple cores?

mem2_ssd1_v2_x16

Is the software GPU optimized?

mem2_ssd1_gpu_x32

How much memory does the software use (per core)?

mem2_ssd1_v2_x16

How much disk space is needed for the software (per core)?

mem2_ssd1_v2_x16

Always use version 2 of an instance type!

mem2_ssd1_v2_x16

Instance Classes and Cores

Each class (like mem1) is scaled so that each core in an instance has access to the same amount of memory/disk space:

  • Example: mem1_ssd1_v2_x2:

    • 4 Gb total memory / 2 cores =

    • 2 Gb / Core

  • Example: mem1_ssd1_v2_x8:

    • 16 Gb total memory / 8 cores =

    • 2 Gb / core

Choosing a Good Instance Type

  • Scale usage/instance type according to usage statistics and dataset size

    • If it doesn't utilize all resources

      • Use a smaller instance type

    • Runs out of memory, or is slow

      • Consider using a larger instance type

Multistep Workflows

  • Each stage of a workflow is run by a different set of workers

  • Each stage can be customized in terms of instance type

Resources

Instance Types Documentation

Full Documentation

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

For HPC Users

If you are an HPC user new to the DNAnexus platform , these sections are recommended for you:

Background InformationGeneral InformationFor HPC UsersOverview of the PlatformCommand Line Interface (CLI)JSONFor Titan UsersFor Apollo Users

For Experienced Users

If you are an experienced user new to the DNAnexus platform, these sections are recommended for you:

For Titan UsersFor Apollo UsersJSONDocker

JSON

For Apollo Users

If you are an Apollo user, these sections are recommended for you:

Overview of the PlatformBilling Access and OrgsCommand Line Interface (CLI)Cohort BrowserJupyterLab

Any background information that could be necessary are listed in the For HPC or For Scientists pages to get you started there as well.

WDL

In this section, you will build the same applet examples from bash and Python as tasks, and then graduate to building workflows by chaining tasks together.

Interactive Cloud Computing

Command Line Interface (CLI)

Data Profiler

Building Workflows

Workflows are a set of 2 or more apps that are linked together by dependencies, or when the output of one app/ applet is the input to another app/applet. A workflow will allow for these apps to be ran after the dependencies are met without having to submit another job (unless there is an error).

We support the following options for building workflows: * Native (GUI) * WDL * Nextflow

In order to kill a job/ workflow/ app/applet you will need to terminate the job/ analysis. Please use dx terminate or terminate in the Monitor tab in the UI.

AI/ ML Accelerator

Docker

Overview of the Platform

Python

All the same examples from bash now in Python.

How to Create Cohorts

Portals

Overview of the Germline Variants Tab

Within the Germline Variants Tab, there are the following sections: the search bar for genes by gene symbol and genomic ranges, the Allele Frequency Lollipop Plot, and the Allele Table with the germline mutations that are present in the lollipop plot above. The tables and figures of the Germline Variants Tab are highlighted in the figure below:

Lollipop Allele Plot

The first figure that is shown on the tab is the lollipop plot. The x axis is the position of the mutation, and the y axis is the allele frequency. You can search for the genomic range by Gene Symbol, Genomic Range, or rsID. The lollipop plot and allele table will be updated once you search for the new genomic range.

Allele Table

The second figure that is shown on the tab is the Allele Table. The columns available are the location (defined by chromosome and position), rsID, Reference and Alternate nucleotide, Type of Mutation, Consequence, Cohort AF (Allele Frequency), Population AF, and GnomAD AF. You can search for the genomic range by Gene Symbol, Genomic Range, or rsID. The lollipop plot (described above) and allele table will be updated once you search for the new genomic range.

here
3. Click Add Cohort Filter

4. Make sure "Is Any of" is selected, click on empty field

5. Select details for the characteristic. Ex: selecting Ovary

6. Your cohort panel will then look like this:

  1. Repeat steps as necessary to filter as needed to create your cohort

And/ Or Functionality

In this example, we are going to create 2 different filters. One will be a sample with where the tumor disease anatomic site is the ovary, and another where the site is breast.

If in the cohort filter we select the tumor disease anatomic site “is ovary” AND tumor disease anatomic site “is breast”, then we have zero patients.

This is seen the in the figure below:

Instead, we would need to change this to “OR” by pressing the “AND With” portion of the filter.

Now, we have a filter that has the tumor disease anatomic site as the ovary or the breast.

Supplemental Video for the Overview Tab + Filtering

Phenotypic Data Exploration

Please note, the data present in this page is intended for training purposes only. Information about the data present in this documentation is listed here.

The overview tab is dedicated to the phenotypic data that has been ingested in your dataset. The phenotypic data is able to be displayed utilizing tiles, and these tiles will have different tables or figures based on their data type.

Adding Tiles

Simple Tiles

  1. Open Cohort Browser

  2. Select "+ Add Tile" on the top right corner

  1. Find the characteristic you want as a tile and select "Add Tile"

  1. Repeat until you select the amount of tiles that you are wanting (up to 15)

2D Plots

  • Used for more advanced comparisons

  • Add comparisons by selecting the first filter, then selecting the "+" sign for a secondary field

  • Then, edit the data field details

Here is the overview of the 2D plots that are available based on data types:

Steps to Create 2D plots

  1. Open Cohort Browser

  2. Select Add tile on the top right corner

  1. Find the characteristic you are wanting to start with and select it, such as biological sex. This is the same step as adding a regular tile, but you will NOT select Add tile.

  1. Instead, add a secondary field by selecting this next to the second characteristic you are wanting to view.

  1. You will then have options to change the graph with those parameters.

  1. Then, select the add tile button on the bottom right below the new graph. This will add it to the cohort browser.

Limits on the Cohort Browser

  • Limited to 15 tiles overall in dashboard

  • Limited to 30 columns in Data Preview

  • Add 1-2 tiles at a time, wait for them to refresh before adding more tiles.

Billing and Pricing

Billing

Definition

Billing occurs monthly based on your use of the platform. These invoices are received at the end of the month

The relationship of DNAnexus and billing are highlighted here:

Billing and Charges

What are the charges?

  • Regions and Pricing can be referred to as the "Rate Card"

  • These are negotiated at the time of signing

  • This is the area of expertise of the DNAnexus Sales Account Director. For further details on this, please refer to them.

  • This can be useful for everyone else to decide the instances that you choose to run on the platform.

Errors and Billing

  • Job Errors happen

    • Some of which are charged to you

    • Some of which are not

  • Error details are found in our

Organizations and Billing

Example: Orgs and Billing

  • Orgs can be used to consolidate and simplify billing.

  • An order can be associated with a billing account. This allows all users of the org to build projects and apps to the org billing account.

  • If you have a project bill to an org, this is useful. Say if you have users within a group or users within a particular lab, that are working with a shared budget, where each member needs to have the ability to work independently within their own project.

  • By associating a billing account with an org, this allows groups with a shared budget to consolidate all platform activities onto one invoice.

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Cloud Computing for Scientists

Basic Concept and Terminology

Key Players in Understanding Cloud Computing

  • Your Computer: When we utilize cloud resources, we as users request them from our own computer using commands from the dx toolkit.

  • DNAnexus platform: The platform has many working pieces, but we can treat it as one entity here. Our request gets sent to the platform, and given availability, it will grant access to a temporary DNAnexus Worker.

  • DNAnexus Worker: This temporary worker is the third key player and is where we do our computation on. We'll see that it starts out as a blank slate.

Specific Terms Outside of Key Players

  • A project contains files and executables and logs associated with analysis securely stored on the platform

  • The executables on the platform are referred to as apps. Apps are executables that can be run on the DNAnexus platform. Most importantly, they need to contain a software environment to run the executable.

  • A software environment in general is everything needed to run software on a brand new computer. This includes the software itself that you are needing as well as any dependencies that are needed to run the software. Some examples of dependencies are languages (such as R) that are needed to execute the software.

Project Storage vs Workers

Project storage is permanent, but the workers are temporary. This means that you have to relay information back and forth as shown in the figure below.

The key concept with cloud computing: project storage can be considered as permanent on the platform. Note that workers are temporary. Because workers are temporary, we need to transfer the files we want to process to them. When we are done, we need to transfer any output files back to the project storage. If we don't do this, the files will be lost when we lose access to the worker.

Local vs Cloud Analysis

Local Machines

  • On your local computer, everything is on your machine.

    • This includes your data and the scripts, as well as your software environment and dependencies are also downloaded.

    • The results and in between steps are also generated and saved on your machine as well.

  • You own it and you control it.

Cloud Computing

  • In comparison, cloud computing adds layers into analysis to increase computational power and storage.

  • This relationship and the layers involved are in the figure below:

  • Let's contrast this with the process of processing a file on the DNAnexus platform.

Key Differences

  • The first difference is that we need to request a worker and we only have temporary access to it. We need to bring everything to the worker, including the software environment.

  • The second key difference is that we need to bring our files and scripts from project storage to the worker.

Common Challenges with Cloud Computing

Challenge 1: Requesting Enough Resources

  • Our first barrier is requesting an appropriate worker that can do our computational job.

  • For example, our app may require more memory, or if it is optimized for working on multiple CPUs, more CPUs.

  • We need to understand how big our files are and the computing requirements of our software to do this.

Challenge 2: Installing Dependencies

  • Our second barrier is installing the software environment on the worker, such as R.

  • Because we are starting from scratch on a worker, we will need ways to reproducibly install the software environment on the worker.

  • We'll see that this is one of the roles of Apps. As part of their job, they will install the appropriate software environment.

Resolution for Challenge 1 and 2:

  • There is some good news. If we are running apps, they will handle both of these barriers.

  • Number one, all apps have a default instance type to use. We'll see that we can tailor this.

  • Secondly, Apps install the required software environment on their workers.

Challenge 3: Transferring Files

  • Our third barrier is getting our files onto the worker from project storage, and then doing computations with them on the worker. The last barrier we'll talk about is getting the file outputs we've generated from the worker back into the project storage.

  • Cloud computing has a nestedness to it and transferring files back and forth can make learning it difficult.

  • Having a mental model of how cloud computing works can help us overcome these barriers.

Resolution for Challenge 3:

  • Cloud computing is indirect, and you need to think 2 steps ahead.

  • Here is the visual for thinking about the steps for file management:

Solution for Challenges: Apps

  • Apps help you address installing software on worker

  • Prebuilt software environment that is installed onto the temporary worker

  • Can build our own apps

  • Apps serve to (at minimum):

Resources

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Gene Expression Filtering

To filter with gene expression data, you can add a filter based on the tiles created in the Gene Expression tab or use the “+ Add Filter” button in the Cohort Banner.

These filters allow for:

  • Assessing impact of genes/ features and their expression levels

  • Building Cohorts based on Gene Expression Level

You Can Filter By:

  • Gene Symbol or Ensembl ID with Expression Level

Adding Gene Expression Filters

  1. Add in your dataset

  2. Select "+ Add Filter"

  1. Select Assays and then under Gene Expression, select “Features/ Expression”

  1. Select the genes that you want as well as the expression range. Please note, for the Gene/ Feature value, you can select by Gene Symbol or the ENSEMBL ID.

Supplemental Video for Gene Expression Tab and Filtering

Overview of the Cohort Browser

Please note: in order to use Cohort Browser on the Platform, an Apollo License is needed.

A Tour of the Cohort Browser

Purpose of the Cohort Browser

The cohort browser is used for browsing and visualizing data and creating cohorts. These cohorts can then be shared in a project space to your collaborators.

Metadata

Metadata keeps your data objects and projects organized. All objects that are uploaded or created will have associated metadata with fields such as Name, ID, Path, Status, Class, File Size, Created by, Created, and Modified.

Data Objects

Object Classes include:

  • Data files

Setting Up a Project

Projects have a series of features designed to facilitate collaboration, help project members coordinate and organize their work, and ensure appropriate control over both data and tools.

Creating a Project

All work takes place in the context of a project. Projects allow a defined set of users and orgs to:

  • Access specific data

Native Workflows

What is a Workflow?

  • The individual apps can be easily combined into pipelines, which are on the DNAnexus platform referenced as workflows.

  • These apps are linked together by dependencies and can hand off their outputs to other apps as they complete.

Somatic Data Exploration

Please note, the data present is intended for training purposes only. Information about the data present in this documentation is listed .

When somatic variant data is present in your data ingestion for the cohort, the Somatic Variants tab will appear in the Cohort Browser. The goal of viewing data within the Somatic Variants tab is to view somatic mutations present in your data, and to explore variants and events for certain genomic regions. You can also compare these values within 2 different cohorts, as long as they have the same underlying database.

Features of the Somatic Variants Tab

Nextflow Setup

In order for Nextflow to run correctly on the platform, please do the following:

  1. Install dxpy/ dx-toolkit. Details on how to do this is in the Command Line Interface Section under Introduction to the CLI.

    1. As Nextflow on DNAnexus is being updated with bugfixes and improvements on a regular basis, we recommend updating dxpy to the latest version prior to building your Nextflow applet.

    2. You can upgrade dxpy by using the following

Somatic Data Filtering

To filter with somatic data, use the “+ Add Filter” button in the Cohort Banner.

These filters allow for:

  • Assessing impact of ingested somatic variants in cohorts

Utilizing a Snapshot

Since the JupyterLab jobs are hosted on a temporary worker, you will either need to download the software packages every time you start a job, or to save a snapshot of the software environment.

What is a Snapshot?

  • A snapshot saves the current software environment in the JupyterLab environment.

Tips and Tricks for JupyterLab

DX Notebooks Naming

All notebooks saved onto the platform will have a DX prefix in front of them. Here is an example:

Updating Your Portal

To upload your files, you will need to do the following:

  • create a folder with the org name for the portal. It will be org-NAME OF COMMUNITY

  • Make sure all of your json files are in the folder

  • Make sure all of your assets/ images are in the folder.

Bash

In this section, we will build several native bash applets that will increase in complexity:

  1. An applet that takes an input files, runs a single Unix command, and returns the result as a file.

  2. An applet that includes a binary executable file in the resources directory.

  3. An applet that installs the dependency cnvkit

at runtime and then as an asset.
  • An applet that runs samtools.

  • In order to kill a job/ workflow/ app/applet you will need to terminate the job/ analysis. Please use dx terminate or terminate in the Monitor tab in the UI.

    This is great, but limited by how much storage and computational power that you have on your local machine.

  • This is highlighted in the figure below:

  • We'll start with our computer, the DNAnexus platform, and a file from project storage.
  • We first start out by using the dx run command, requesting to run an app on a file in project storage. This request is then sent to the platform, and an appropriate worker from the pool of workers is made available.

  • When the worker is available, we can transfer a file from the project to the worker.

  • The platform handles installing the app and its software environment to the worker as well.

  • Once our app is ready and our file is set, we can run the computation on the worker.

  • Any files that we generate must be transferred back into project storage.

  • Request a worker (Challenge 1)

  • Configure the worker's environment (Challenge 2)

  • Establish data transfer (Challenge 3)

  • Running apps are covered throughout the rest of the documentation.

  • Full Documentation

    Nextflow

    JupyterLab

    has to be .json, .png, or .jpg

    Then,

    • Ensure that you have md5 and jq downloaded

    • Ensure that you have the manage_community_assets.sh script (this is already provided to you when you have a license for the portal)

    Finally,

    Run one of the following lines of code

    To upload or update the portal assets:

    to delete the portal assets:

    Remember to clear your browser cache after updating the portal assets.

    Resources

    Portal Documentation

    Full Documentation

    Please email [email protected] to create a support ticket if there are technical issues.

    bash manage_community_assets.sh path/to/org-org_name 2
    bash manage_community_assets.sh path/to/org-org_name 1

    Here is an image of what a rate card looks like, and what each of the sections mean. The details of the rate card are subject to change

  • If you cannot access the rate card or are not an org admin, please see Appendix A of your order form.

  • When a user makes a project billable to an account, a user assigns ownership of the project to this work.

  • And org admins in this case admins, A and D. They have the ability to oversee and discover all projects that are billed to the org and revoke permissions to a project build to an org

  • Documentation
    Billing and Account management Documentation
    Full Documentation

    Apps and Applets

  • Workflows

  • Jobs

  • Analyses

  • Records

  • Each object receives their own unique ID

    • These can be file IDs or job IDs

    • These NEVER change

    • The same file can be uploaded multiple times into a project; different objects will be created. The platform DOES NOT overwrite a file. Instead, it creates a new file ID every time you upload it.

    • Metadata is essential to keep track of these files and their properties, since we cannot change the file ID.

    The data objects can have have 2 different items of custom metadata that can be added at any point. They are:

    • Tags: which are words to describe the file format, genome, etc.

      • Examples: fastq, control, bam, vcf

    • Properties: are key/ value pairs that can be used to describe the file

      • Examples: sample_id, value= 001

    Viewing Metadata

    Go to your project folder and find the file information. Identify the columns for the name, type/ class, and tags

    You can do this in the overview of each, without selecting the file:

    Or, you can do this by selecting each file individually to see a detailed view:

    Organization

    You can sort and organize data based on pre-established and custom metadata by selecting the “column” icon in the top right. Columns can also be sorted by hovering over the title.

    Filtering Metadata

    You can filer by the metadata present in the project space. The options are drop down menus above the overview of the metadata headings.

    Data Operations

    When viewing details of a particular data object, you will have a section for the data operations of a file. These include archive, copy, delete, and download. These operations will vary based on the access permission that you have for a given project. You can see the data operations available in the image below:

    Browse, explore, and analyze this data

    Once you have access to the platform and an org that allows for billable activity, you can start working by creating a new project in the UI.

    Setting Up Your Project Space

    1. Navigate to the Projects list page, by selecting Projects in the UI from the main menu, then clicking All Projects.

    2. Click the New Project button (highlighted in gold).

    1. The New Project wizard will open in a modal window.

    • In the Project Name field (highlighted in light blue), enter a name for your project.

    • In the More Info section (highlighted in gold), add in the optional fields for sorting projects, such as

      • Tags

      • Properties

      • Project Summary

      • Project Description.

    • In the Billing Section (highlighted in navy), select the billed to org and Region.

      • In the Billed To field, choose an account to which project billable activities should be charged.

      • In the Region field, select your region if it is not already selected.

    • In the Usage Limits section (highlighted in chartreuse), select the optional compute usage limit and the egress limit. Please note, if you do not have this option and would like to, please contact our sales team at [email protected] or a member of our Success Team.

      • Compute Usage Limits are the monthly compute usage limit for a given project. This value is in USD ($).

      • Egress Usage Limits are the monthly egress limits for a given project. This value is in bytes.

    • In the Access section (highlighted in black), specify which users will be able to conduct data-related operations within the project.

      • Copy Access will limit who can copy data into other projects, or who can use the data as inputs in another projects. The options are All Members or No One.

      • Delete Access will limit who can delete the project. The options are Contributors and Admins or Admins Only.

      • Download Access will limit who can download data from the project. The options are All Members or No One.

    Apps in a workflow will always begin executing as soon as their inputs are satisfied and if possible they will run independently.

  • Workflows can be created by clicking on the Add button and selecting the New Workflow.

  • This is what it will look like once you select "New Workflow"

    Running and Monitoring a Workflow

    Set Up and Running the Workflow

    • Add the apps that you want for the workflow and order them where the dependencies are generated first

    • After that, add in the necessary requirements. They are featured below:

    • Select Start Analysis

    • You will have a "pre-flight" check to make sure everything that is needed is there. Once that is complete, select Start Analysis again and it will start to run.

    Monitoring a Workflow

    • You will be redirected once you have started the analysis.

    • The monitor has panels to show what is running, how long it took to complete, and the order they were done in.

    • You can view the information in order to see the details of the workflow.

    Supplemental Information

    Building a Workflow in the GUI

    Monitoring An App/ Workflow

    Resources

    User Interface QuickStart Guide

    Tool Library List

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Overview of the Somatic Variants Tab

    Within the Somatic Variants Tab, there are the following sections: the Variant Frequency Matrix for the Cohort, a gene Lollipop Plot with search bar by Gene Symbol or Genomic Range, and the Variants and Events Table with the somatic mutations that are present in the lollipop plot above. The tables and figures of the Somatic Variants Tab are highlighted in the figure below:

    Variant Frequency Matrix (Oncoplot)

    A variant frequency matrix has the following features:

    • Genes are sorted (rows of the plot) in descending order of percent of affected samples.

    • Samples are sorted (columns of the plot) by the greatest number of mutated genes across all genes, independent of top mutated genes, in descending order.

    • Each Variant Frequency Matrix has a color scheme by consequence.

    • These features will also work while comparing cohorts.

    • You can also hover over the patient tiles individually for more information.

    There are several options to view these Somatic Variant Frequency Matrices. You can see an overview of all of the somatic mutations, or a particular mutation type, such as Single Nucleotide Variants and Insertions/ Deletions (SNV and Indel), Structural Variants (SV), Copy Number Variants (CNV), and Fusions.

    The first figure gives an overview of the top genes that are mutated in “All” categories, as shown below:

    You can select the individual Variant Frequency Matrices in the drop down menu next to the heading “Variant Frequency Matrix”.

    The options of the matrices are shown in the figure below. The options are SNV and Indel (top left), SV (top right), CNV (bottom left), and Fusions (bottom right).

    Lollipop Plot

    A Lollipop Plot has the following features:

    • Only one gene / canonical protein can be viewed at a time

    • Each lollipop will be color coded by consequence

    • You are able to navigate to a particular Gene Symbol or Genomic range utilizing the search bar

    • You can select (click) a single amino acid change (one lollipop) to quickly filter the somatic variants table

    • Features also work while comparing cohorts

    • You can also hover over the patients tiles individually for more information

    Variant Data Table

    This is a tabular version of the data that you see in the Lollipop plot. You can quickly filter this data while using the lollipop plot (described above) or by filtering on any of the column headers in the table.

    here

    The version of dxpy that you use controls the version of the DNAnexus Nextflow executor and thus the version of Nextflow that is used for executing your pipeline.

  • Nextflow and dxpy versions

    1. Most nfcore pipelines require versions of Nextflow starting with '23' and you will need to use a recent version of dxpy

    2. dxpy versions >= v0.368.1 use Nextflow version 23.10.0

    3. dxpy version >= 0.343.0 and < 0.368.1 use Nextflow version 22.10.7

    4. For example an Nextflow applet built using v0.370.2 will have Nextflow version 23.10.0 bundled with it and it will use this version of Nextflow and v0.370.2 of dxpy for executing the Nextflow pipeline on the platform.

  • Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Building Cohorts based on Somatic Variants
  • Exploratory Data Analysis

  • You Can Filter By:

    • Gene Symbol or Genomic Range

    • Variant effect

    • Variant type

    • HGVS Notation

    • Variant IDs

    Adding Somatic Filters

    1. Add in your dataset

    2. Select "+ Add Filter"

    1. Select Assays and then under Variant (Somatic), select “Genes/ Effects”

    1. Select the genes/ impact/ variants that you want. Please note that the Genes/ Genomic Ranges will accept only Gene Symbols or genomic ranges.

    Supplemental Video for the Somatic Variant Tab + Filtering

    The purpose of the snapshot is to provide a reproducible environment for the JupyterLab jobs.
  • It sets up the environment every time you utilize the snapshot. You do not need to manage dependencies every time you open a JupyterLab job if you utilize a snapshot.

  • Snapshots are saved in the .Notebook_Snapshots/ folder in the project space, and they have a .tar.gz file ending.

  • Creating a Snapshot

    Utilizing a Snapshot

    Snapshots are used in the input section when setting up the JupyterLab Job.

    The input is highlighted in the figure below:

    Snapshot Best Practices

    • Don't save data in your snapshot - it uses storage space and impacts costs.

    • Snapshots can be large and take up storage space.

    • Make sure to rename the snapshot according to your organization's naming conventions: you can remember what they refer to when returning to the project in the future.

    Storage Locations in JupyterLab

    There is both worker related/ JupyterLab storage, as well as what is present in the Project storage. This is annotated in the figure below:

    Code Blocks

    1. When you are running code blocks, remember that in JupyterLab you can run them out of order. This means that you need to pay attention to the numbers on the side of the code blocks for the order. This is highlighted in gold below:

    1. If you choose to write in Python or R primarily, you can use the following at the top of your code block to "switch" to bash scripting. Example below

    Gene Expression Data Exploration

    Please note, the data present is intended for training purposes only. Information about the data present in this documentation is listed here.

    When gene expression data is present in your data ingestion for the cohort, the Gene Expression tab will appear in the Cohort Browser. The goal of viewing data within the Gene Expression tab is to view gene expression values in your cohort, and to compare between 2 cohorts within the same data base.

    Features of the Gene Expression Tab

    Overview of the Gene Expression Tab

    Within the Gene Expression Tab, there are the following sections: plots for Gene Expression, where you can search for genes by Gene Symbol or Ensembl ID, and an Expression per Feature table. The tables and figures of the Gene Expression Tab are highlighted in the figure below:

    Gene Expression Plots

    To view gene expression for a specific gene, type the gene symbol or Ensembl ID into the search bar for the charts labelled “Expression Level”. There are 3 options for the plots: Expression Level with a box plot, Expression Level with a histogram, and a Feature Correlation scatter plot between 2 genes. More than 3 tiles can be added with the “Add Tile” button, and typing in the Gene Symbol or Ensembl ID.

    Expression Level Box Chart

    For the box plot, you can see the distribution of the expression level for a given gene by typing in the gene symbol or Ensembl ID to the search bar. You can view the detailed distribution as a violin plot, or as a box plot. The x axis is the distribution of gene expression levels in the cohort and the y axis is the Expression Level. The options to view the detailed distribution are part of the Chart Settings. The Bar Chart with the Violin Plot (detailed distribution) is shown below:

    Expression Level Histogram

    For the histogram, you can see the distribution of the expression level for a given gene by typing in the gene symbol or Ensembl ID to the search bar. You can see the histogram with or without the display statistics. The x axis is the distribution of gene expression levels in the cohort and the y axis is the Expression Level. The options to view the detailed distribution are part of the Chart Settings. The histogram with the display statistics settings are shown below:

    Feature Correlation

    For the feature correlation, you can see the expression level for a given gene for the x and y axis by typing in the gene symbol or Ensembl ID to the search bar. You can see the feature correlation with or without the display statistics. The x axis is the gene expression level for one gene and the y axis is the gene expression level for another gene. The options to view the detailed distribution are part of the Chart Settings. The feature correlation with the display statistics settings are shown below:

    Introduction

    Building Applets

    DNAnexus apps and applets are ways to package executable code. The biggest difference between apps and applets is their visibility. Apps such as you find in the Tool Library are globally available and maintained by DNAnexus and partners like Nvidia and Sentieon. Applets are private to an organization and exist as data objects in a project. They can be shared across projects and promoted to generally available apps. Native DNAnexus applets are built using dx build to create an executable for bash or Python code, which in turn may execute any program installed on the instance.

    Later, we will discuss how to build a workflow, which is a combination of two or more apps/applets. We will build native workflows using the GUI and languages like WDL (Workflow Description Language) and Nextflow combine with Docker images.

    Development Cycle

    As shown in following figure, the development cycle is to write code locally, use dx build to create a native applet on the platform, and then dx run to run the applet. You can view the execution logs with dx watch, then make changes to your code to build and run again.

    Installing Required Software

    To install the Python modules required for this tutorial, run the following command:

    You may be prompted to expand PATH with installation directory such as ~/.local/bin:

    Next, ensure you have a recent version of Java. For this tutorial, I'm using the following:

    If you want to use to execute WDL locally, you should download the Cromwell Jar file. This tutorial assumes you will place this file in your home directory using the following commands.

    I suggest you use the link command (ln) to create a symlink to the filename cromwell.jar so that upgrading in the future will not break your commands:

    (Workflow Object Model) is also quite useful, and I suggest you similarly download it and link it to womtool.jar:

    You will use the DNAnexus to build WDL applications on the platform. Find a link to the latest Jar file under the releases of the . For example, the following commands will download dxCompiler-2.10.4.jar to your home directory and symlink it to dxCompiler.jar:

    Some tools may attempt to use the tool to validate any shell code in your WDL. To install on Ubuntu, run the following:

    On macOS, you can use to install the program:

    The dx CLI

    If the dxpy module installed properly, you should able to run dx on the command line. For instance, run dx all to see :

    To get started, do the following:

    • Run to identify yourself to the DNAnexus platform. Enter your username and password. You can also set up a token to log in. Information on setting up tokens can be found in the section of our Documentation.

    • You may also be prompted to select a project. If not, you should use to select a project that will contain your work.

    • If you do not see a project you wish to use for your work, run dx new project to create one from the command line, or click "New Project" in the web interface.

    Note that each subcommand will respond to the flags -h|--help to display the usage documentation. For instance, dx new can create several object types, which you can discover by reading the documentation:

    You should now be prepared to develop DNAnexus apps and workflows.

    Resources

    Please email to create a support ticket if there are technical issues.

    Combining Different Filtering Types

    To filter with gene expression data, you can add a filter based on the tiles created in the Gene Expression tab or use the “+ Add Filter” button in the Cohort Banner.

    These filters allow for:

    • Creating a more complex cohort with Phenotypic and genomic filtering

    You Can Filter By Combinations with the following data:

    • Phenotype/ Clinical data

    • Germline Variants

    • Somatic Variants

    • Gene Expression Changes

    Adding Multiple Filters

    1. Add in your dataset

    2. Select "Add Filter"

    1. Choose the filtering that you are interested in. Details for Phenotype Filtering, Germline Filtering, Somatic Filtering, and Gene Expression are available in previous sections of the documentation. (These will have links).

    2. Once the initial filter is complete, select “Add Additional Criteria” next to the filter, as shown below:

    3. Repeat the process for the next cohort filter that you need.

    Supplemental Videos for the Cohort Browser

    Combining Cohort Filters: Phentotypic and Germline Variants

    Combining Cohort Filters: Phenotypic and Somatic Variants

    Combining Cohort Filters: Phenotypic and Gene Expression

    TTYD

    Starting a TTYD Instance

    You can start a TTYD job the same way as you would any other job in the UI.

    1. Select Start Analysis in the top right corner in the project space.

    1. Select the app called “ttyd”.

    1. Select Next and then Start Analysis.

    1. As the last step before launching the tool, you can review and confirm various runtime settings. Click on Launch Analysis. The job will be launched and you will be redirected to the Monitor tab in a few seconds.

    2. In the Monitor tab, select the name of the ttyd job to view more details.

    1. Once the state of the job switches to “Running”, you will be able to enter the ttyd with the “Open Worker URL” link in the top heading of the details page. If the page to which you get redirected says “502 Bad Gateway”, the worker is not yet fully initialized. Close the page, give it a few more minutes and try to open the worker URL again.

    1. This will open a terminal in your browser that will give you access to the files in the DNAnexus project in which the app is running by mounting it in a read-only mode in the /mnt/project directory of the worker execution environment.

    2. Once you are done with your work in ttyd, don’t forget to terminate the job by clicking the red button Terminate in the job’s details page.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select “Contact Support”

    3. Fill in the Subject and Message to submit a support ticket.

    TTYD vs Cloud Workstation

    TTYD vs Cloud Workstation

    TTYD

    Cloud Workstation

    Purpose

    To have terminal access in your web browser

    sets up a virtual workstation that lets you access and work with data stored on the DNAnexus Platform

    Time Limits

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select “Contact Support”

    3. Fill in the Subject and Message to submit a support ticket.

    Germline Data Filtering

    To filter with germline data, use the “+ Add Filter” button in the Cohort Banner.

    These filters allow for:

    • Assessing impact of ingested variants in cohorts

    • Note: only non-ref variants are represented in the genomic data

    • Building Cohorts based on Variants

    • Develop basket studies based on your population

    • Exploratory Data Analysis before GWAS

    • Ask questions about co-occurrence with other mutations

    You Can Filter By:

    • Gene Symbol or region

    • Variant effect

    • Variant type

    • Variant ID

    Adding Germline Filters

    1. Add in your dataset

    2. Select "+ Add Filter"

    1. Select Assays and then under Genome Sequencing, select “Genes/ Effects”

    1. Then, select the genes/ impact/ variants that you want. Please note, the filtering for the Genes/ Genomic regions is by Gene Symbol or Genomic range.

    Supplemental Video: Germline Data Tab + Filtering

    Overview of JSON files for Portals

    Disclaimer: Portals require a license. These documents are to get you started with your portals. By no means is this the only way to make your portal, nor is this the only way to edit a json file.

    Each section of a portal has a different json file.

    Here is a visual of which json file edits which section of a portal:

    navigation.json

    This section defines the following:

    • navigation/ header bar

    • items that are in the header after the logo that are also not included in the branding.json

    • You can also add/ delete navigation items

    branding.json

    This section defines the following:

    • logo

    • colors

    • if you want a login page

    • a home URL attached to the logo

    home.json

    • This controls the home page for the community portal

    • You can specify the following:

      • order of the sections

      • components

    Resources

    Please email to create a support ticket if there are technical issues.

    Introduction

    Resources for JSON

    • What are JSON files and how do you use them?

    • What is a JSON file?

    JSON File Basics

    • JavaScript Object Notation

    • Common format for communicating with Application Program Interface (API)

    • Used to access DNAnexus API servers

    Why Should You Learn JSON?

    • Reading and modifying JSON is at the heart of building and running apps

    • Understanding JSON responses from the API will help you debug jobs

    • Automation and Batch submissions: running the same app on multiple files

      • Find which jobs have failed and why

    Elements and Reading JSON

    A valid JSON document is enclosed in one of two data structures, either a list of values contained in square brackets:

    Or an object composed of key/value pairs contained in curly brackets:

    Example:

    A JSON value may be any of the following:

    • single- or double-quoted string, e.g., "samtools" or 'file-G4x7GX80VBzQy64k4jzgjqgY'

    • integer, e.g. 19 or -4

    • float, e.g., 3.14 or 6.67384e-11

    Lists

    • Lists are braced in square brackets [ ]

    • Similar to Python syntax

    • Used for multiple values separated by commas

    Example:

    Objects

    • An object starts and ends with curly braces

    • An object contains key/value pairs

    • Keys must be quoted strings

    • Values may be any JSON value, including another object

    Example:

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Useful Information

    You also do not need to define executor like you might do for some other cloud Nextflow executors. By default the executor is 'local'. However, if you are for instance going to be running Nextflow in multiple locations and want different settings based on location you could set a DNAnexus profile in your nextflow.config which explicitly defines the executor and things like default queueSize.

    Here is an example DNAnexus executor profile which also enables docker.

    when running on DNAnexus you would then give '-profile dnanexus' to 'nextflow_run_opts' in the UI or in the CLI it would be -inextflow_run_opts='-profile dnanexus'

    You could also create a test profile for testing on your own servers/cloud workstation and on DNAnexus.

    Tips and Tricks

    • If pipeline contains inputs from external sources (such as S3, FTP, HTTPS), those files are staged in the head-node and may run out of storage space (inputs sources from DNAnexus are not staged in this way).

      • The instance size of the head-node can be customized: in "Applet Settings" on the UI with the --instance-type flag on the CLI

    • 20 sessions can be cached per project

      • The number of times any of those sessions can be resumed is unlimited

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

    Adding Users to a Project

    You can collaborate on the DNAnexus Platform by giving project access to other users. Project access can be revoked at any time by a project administrator.

    Adding Project Members

    Once you've created a project, you can add members by doing the following:

    1. From the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the project page.

    Running a Spark JupyterLab Notebook

    Use Cases for Spark JupyterLab Instances

    • Utilization of load_cohort. It requires running SQL on Spark and has a specialized functionality that we only support via dxdata (python).

    • Complex interactions with records/Spark must be done via Python.

    Running Docker with Swiss Army Knife

    Here is the overview of

    To use an gatk image (made in previous sections) with the Swiss Army Knife tool, you will do the following:

    Command Line

    From there, this will be what the command line prompts will be:

    From there, you will see the job log of the Swiss Army Knife App.

    Resources To Learn Nextflow

    Nextflow has resources depending on where your skill level is with using their workflow language.

    The general page for these resources can be found at and

    Nexflow also has a general and

    New to Nextflow Users

    Creating Docker Snapshots

    This is a walk through of how to add an existing Docker Image to the platform and saving it as a snapshot file on the platform

    To get started with this, you will either need to 1) open a ttyd or 2) have Docker installed and use you local terminal with dxtoolkit installed as well.

    Overview of How to Use the Docker Image to Snapshot File

    Using Docker

    Why Docker?

    • Portability - Run code and scripts on any machine that has Docker installed

      • Avoids installation headaches on multiple machines (dependencies are installed with software)

    Navigation JSON File

    Disclaimer: Portals require a license. These documents are to get you started with your portals. By no means is this the only way to make your portal, nor is this the only way to edit a json file.

    Overview of the navigation.json file

    • this .json file will personalize the banner that you navigate to different sections.

    Accessing Data Profiler in ML JupyterLab

    If you also have access to the ML JupyterLab (another solution in the AI/ML Accelerator Package), Data Profiler can be seamlessly opened in the JupyterLab environment, offering an intuitive and interactive tool for profiling multiple datasets directly within one workspace.

    To get started, simply open an ML JupyterLab notebook, load the dataset, and profile it.

    Profiling the Dataset

    The integrated version of Data Profiler in ML JupyterLab (dxprofiler) offers four methods for loading your datasets to profile the data:

    pip3 install --upgrade dxpy
    profiles {
    
        dnanexus {
            executor {
                name = 'local'
                queueSize = 50
            }
            docker {
                enabled = true
            }
        }
    
        cluster {
            executor {
                name = 'sge'
                memory = '20GB
            }
        }
    }
    Nextflow support was initially introduced at dx v0.330.0 and uses a version of Nextflow older than 22.10.7

    Have to manually terminate/ does not have an input for a time limit

    Time limit is an input.

    Snapshots

    None

    Can save snapshots

    SSH

    Does not need SSH Access

    Does need SSH access

    Common Uses

    CLI operations and to launch https apps within the web browser

    Used for analysis of platform data and testing applets, since the environment is what is opened when launching an app or applet

    Full Documentation
  • Sessions can be deleted to allow more, or development/running can be migrated to another project which will have its own 20-session limit Private S3 can be referenced by adding AWS scope to configs https://www.nextflow.io/docs/latest/amazons3.html?#aws-access-and-secret-keys

  • Full Documentation
    support URL
  • descriptions

  • text

  • tables

  • images

  • reference material/ links (shown above)

  • links to DNAnexus projects (shown above)

  • featured tools

  • Portal Documentation
    Full Documentation
    [email protected]
  • Nextflow Foundational Training Videos

  • Advanced Nextflow Users

    • Nextflow Training

    • Nextflow's Amazon S3 Storage Documentation

    • Nextflow's Amazon Cloud Documentation

    Nf-Core

    • Nf Core Documentation

    • Pipelines

    • Homepage

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

    Nextflow Training
    Nextflow Learning Options
    Documentation Page
    Nextflow Maintained Blog
    Nextflow Training

    Finally, run dx ssh_config to set up SSH keys for connecting to cloud instances.

    Cromwell
    WOMtool
    dxCompiler
    Git repository
    shellcheck
    Homebrew
    a list of valid commands
    dx login
    Using Tokens
    dx select
    Full Documentation
    [email protected]

    Run the failed jobs again

    boolean, e.g., true or false

  • null

  • object

  • Full Documentation
    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Swiss Army Knife
    Loading the dataset by specifying a path to the local folder (in the ML JupyterLab job) which contains the .csv or .parquet files.
  • Loading the dataset by a list of .csv or .parquet files.

  • Loading the dataset by Pandas dataframes ('patient_df' and 'clinical_df')

  • Loading the dataset by a record object (DNAnexus Dataset or Cohort). "project-xxxx:record-yyyy" is the ID of your Apollo Dataset (or Cohort) on the DNAnexus platform.

  • Open the Data Profiler GUI

    Once you finish profiling the dataset, here is the command to open the Data Profiler GUI:

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    python3 -m pip install dxpy miniwdl
    PATH=~/.local/bin:$PATH
    $ javac -version
    javac 18
    cd ~
    wget https://github.com/broadinstitute/cromwell/releases/download/84/cromwell-84.jar
    ln -s cromwell-84.jar cromwell.jar
    cd ~
    wget https://github.com/broadinstitute/cromwell/releases/download/84/womtool-84.jar
    ln -s womtool-84.jar womtool.jar
    cd ~
    wget https://github.com/dnanexus/dxCompiler/releases/download/2.10.5/dxCompiler-2.10.5.jar
    ln -s dxCompiler-2.10.5.jar dxCompiler.jar
    sudo apt install shellcheck
    brew install shellcheck
    $ dx all
    usage: dx [-h] [--version] command ...
    
    DNAnexus Command-Line Client, API v1.0.0, client v0.320.0
    
    dx is a command-line client for interacting with the DNAnexus platform.  You
    can log in, navigate, upload, organize and share your data, launch analyses,
    and more.  For a quick tour of what the tool can do, see
    
      https://documentation.dnanexus.com/getting-started/tutorials/cli-quickstart#quickstart-for-cli
    
    For a breakdown of dx commands by category, run "dx help".
    
    dx exits with exit code 3 if invalid input is provided or an invalid operation
    is requested, and exit code 1 if an internal error is encountered.  The latter
    usually indicate bugs in dx; please report them at
    
      https://github.com/dnanexus/dx-toolkit/issues
    
    optional arguments:
      -h, --help  show this help message and exit
      --env-help  Display help message for overriding environment
                  variables
      --version   show program's version number and exit
    
    dx: error: argument command: invalid choice: all
    (choose from login, logout, exit, whoami, env, setenv, clearenv, invite,
    uninvite, ls, tree, pwd, select, cd, cp, mv, mkdir, rmdir, rm, describe,
    upload, download, make_download_url, cat, head, build, build_asset, add, list,
    remove, update, install, uninstall, run, watch, ssh_config, ssh, terminate,
    rmproject, new, get_details, set_details, set_visibility, add_types,
    remove_types, tag, untag, rename, set_properties, unset_properties, close,
    wait, get, find, api, upgrade, generate_batch_inputs,
    publish, archive, unarchive, help)
    $ dx new -h
    usage: dx new [-h] class ...
    
    Use this command with one of the available subcommands (classes) to create a
    new project or data object from scratch. Not all data types are supported. See
    'dx upload' for files and 'dx build' for applets.
    
    positional arguments:
      class
        user      Create a new user account
        org       Create new non-billable org
        project   Create a new project
        record    Create a new record
        workflow  Create a new workflow
    
    optional arguments:
      -h, --help  show this help message and exit
    [
        {
            "project": "project-Gg2QQx002Q7yY4kFQF7GKYPV",
            "id": "applet-G1951vj0YyjJjbvGJ9FZB967",
            "describe": {
                "id": "applet-G1951vj0YyjJjbvGJ9FZB967",
                "project": "project-Gg2QQx002Q7yY4kFQF7GKYPV"
            }
        },
        {
            "project": "project-Gg2QQx002Q7yY4kFQF7GKYPV",
            "id": "file-GGy7Pbj0Xf47XZk125k22g9v",
            "describe": {
                "id": "file-GGy7Pbj0Xf47XZk125k22g9v",
                "project": "project-Gg2QQx002Q7yY4kFQF7GKYPV"
            }
        }
    ]
    {
       "report_html": {
           "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
       },
       "stats_txt": {
           "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
       }
    }
    {
        "dnanexus-link": [
            "file-G4x7GXQ0VBzZxFxz4fqV120B", "file-G4x7GX80VBzQy64k4jzgjqgY"
        ]
    }
    { 
        "report_html": {
            "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
        }
    }
    dx run app-swiss-army-knife -iimage_file="gatk.tar.gz" -iin="data/mock.vcf" -icmd="gatk SelectVariants -V mock.vcf -O selected.snp.vcf –select-type-to-include "SNP""
    import dxprofiler
    dataset = dxprofiler.profile_files(path_to_csv_or_parquet=['/path/to/table1.csv', '/path/to/table2.csv'], data_dictionary=None)
    import dxprofiler
    dataset = dxprofiler.profile_dfs(dataframes={'patient_df': patient, 'clinical_df': clinical}, data_dictionary=None)
    import dxprofiler
    dataset = dxprofiler.profile_files(path_to_csv_or_parquet='/path/to/tables/', data_dictionary=None)
    import dxprofiler
    
    dataset = dxprofiler.profile_cohort_record(record_id="project-xxxx:record-yyyy")
    dataset.visualize()
    1. Type the username or the email address of an existing Platform user, or the ID of an org whose members you want to add to the project.

    1. In the Access pulldown, choose the type of access the user or org will have to the project.

    1. If you don't want the user to receive an email notification on being added to the project, click the Email Notification to "Off."

    2. Click the Add User button.

    3. Repeat Steps 2-5, for each user you want to add to the project.

    4. Click Done when you're finished adding members.

    Removing Project Members

    To remove a user or org from a project to which you have ADMINISTER access:

    1. On the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the page. A modal window will open, showing a list of project members.

    2. Find the row showing the user you want to remove from the project.

    3. Move your mouse over that row, then click the Remove from Members button at the right end of the row.

    Project Access Levels

    Access Level

    Description

    VIEW

    Allows users to browse and visualize data stored in the project, download data to a local computer, and copy data to other projects.

    UPLOAD

    Gives users VIEW access, plus the ability to create new folders and data objects, modify the metadata of open data objects, and close data objects.

    CONTRIBUTE

    Gives users UPLOAD access, plus the ability to run executions directly in the project.

    ADMINISTER

    Gives users CONTRIBUTE access, plus the power to change project permissions and policies, including giving other users access, revoking access, transferring project ownership, and deleting the project.

    Spark JupyterLab is ideal for extracting and interacting with the dataset or cohort.

  • Spark JupyterLab is NOT meant for downstream analysis.

  • General “Recipe” for Utilizing Spark JupyterLab Notebooks

    1. Create a DX JupyterLab Notebook so that it will automatically save onto the Trusted Research Environment. You can do so by selecting these 2 different options:

      1. Option 1 is from the Launcher:

    b. Option 2 is from the DNAnexus Tab:

    1. Start writing your JupyterLab Notebook. Select which kernel you are going to use (options will vary depending on the Image you selected in set up).

    2. Download packages and save the software environment as a snapshot.

      1. Download Packages

      1. Save the Snapshot of the environment

    3. Start writing your code.

      1. Import Packages using import (at minimum, you will need dx data and pyspark)

      b. Load the dataset with dx extract dataset

      c. Initialize Spark

      d. Retrieve data and cohorts that you are interested in

      e. Upload Results back to Project Space

    4. Save your DX Jupyterlab Notebook

    Opening Notebooks from Project Storage

    • Notebooks can also be directly opened from project storage

    • When you save in JupyterLab, the notebook gets uploaded to the platform as a new file. This goes back to the concept of immutability.

    • Old version of notebook goes into .Notebook_archive/ folder in project.

    Example of the Code in Action
    1. You have to pull the Docker image from the registry to the platform. For this example, the code is

    That results in this view:

    Notice that you will have extract and then pull complete on each of the "layers" of the image on the left hand side. This takes a few minutes depending on the size of the docker image

    1. Now you have to save this docker image file. For this example, the code is

    This again takes time depending on the size of your docker file.

    1. Now you will need to upload this image back to the platform. For this example, the code is:

    The last 2 steps have the following output:

    It should then be in the project space that you have chosen. You can also check this in the GUI.

    Example:

    Resources

    Full Documentation

    Using Docker Images

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

  • Make it easy to run batch jobs on multiple instances

  • Reproducibility - be able to run code and generate the same outputs given a set of input files

    • Tie all software to specific versions

    • Utilize Docker images with multiple bioinformatics software installed

    • Examples: Rocker Project, GATK4

  • Docker Terms

    • Docker Registries

      • Collection of repositories that hold container images

      • docker pull: pulls the images from a registry to a container on our machine

      • docker commit: When we commit changes, these changes are saved to the image in registry

    Snapshots vs Images on the Platform

    There are hard limits for using Docker Images.

    • DockerHub and other registries have a pull limit of 200 pulls/user/day

    • Saving a snapshot file to your project lets you scale without these limits

    • Especially helpful in batch processing

    Docker and Security

    • Use images from trusted vendors whenever possible

      • Examples: Official Ubuntu Image, Amazon Linux Image, Biocontainers

      • Avoid "kitchen sink" images - hard to manage vulnerabilities

    • In general: pay attention to possible vulnerabilities and whether they affect your containers

    • Use dockerfiles to uninstall/patch possible vulnerabilities in images

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Examples: projects in the project tab, different tools you want immediate access to in the tool section.

    If you have questions about how to use a json file, please view this section

    Overview of the Sections of a portal and matching json files:

    Example of the navigation.json file

    The navigation.json file for this example is blank. These are the default items in the header

    Sections in your navigation.json file

    This file must be at least accessible to community members.

    This file is optional. It allows you to edit the feature list of projects. _projects, _tools, and supportURL are all optional.

    They can be

    • null, which will remove the item from the header

    • an array of objects

      • Text for the text of the new menu item,

      • url as the destination

      • newTab for if the link should open up a new tab

    • If there is another entry, it indicated that a new navigation item needs to be added.

      • They can be objects with a url and optional parameters. newTab means a link in the navigation when you do this method

      • They can also be array of objects with text, url, and newTab (which will give it a dropdown menu with listed items)

    Examples of items to add to a navigation.json:

    Resources

    Portal Documentation

    Full Documentation

    Please email [email protected] to create a support ticket if there are technical issues.

    Full Documentation

    Orgs and Account Management

    Account Management

    Glossary of Terms

    User

    A single person that is utilizing the platform

    Org

    Collection of users. They are either admin or member level

    Permissions

    Gives the user the ability to work on the platform within the scope of what they are needing to access

    Project

    means of enabling users to collaborate by providing them with shared access to specific data and tools.

    Within a project space, there are different permission/access levels. They are:

    Most restrictive: view the project, move and copy data across projects.

    View and Create folders and modify metadata

    Uploader AND run executions

    Contributor AND change permissions for users, project ownership, and deletion

    Defining Members of an Org and their Relationship

    • is used to represent a group of users

    • Can be used to simplify the sharing of projects, apps, and billing

    • Have members and admins

    control the access to

    Why Add Users to an Org?

    • Allows the access to the shared apps

      • This is for what the org is an authorized user for

      • If the org cannot use the app, the member cannot either

    • Allow seeing the price column in the UI monitor tab and on the command line

    Project Overrides for a User

    • By default, when a project is created, the settings tab shows the following:

    • The owner of the project can change these

    • You may want to restrict them depending on your org policy

    • Copy access

    Sharing in an Org

    The org allows for the sharing

    • of the same resources

      • Control the access as stated above

      • Org admins can remove and add users

    • to users performing similar functions

    Example 1: Orgs and Sharing

    • Sharing projects and apps within orgs allows a group of users performing similar functions to be given the same level of access to shared resources.

    • In this example, there is the org administrator, admin a, who provides view access to the project resources to the org. Additionally, admin A adds users B and C to the org, and also adds admin D to the org.

    • Admin D then provides upload permissions to the project raw data, and makes the org and authorized user of the QC app. So in addition to being a convenient way to share projects and data, the org aids provide access to apps as well.

    Multiple Orgs

    You can have multiple people in multiple orgs.

    Example 2: Multiple Orgs

    • Members who are working on two separate projects, and they need access to different data/ apps which have different budgets.

    • the user may need to create and work on projects that are billed to two step or groups in. This is where creating multiple orgs comes in handy.

    • Admin D is admin of both org and org-new because admin D needs to work within both of these orgs.

    • Admin D adds user E to both org and org- new and only adds member or user F to org- new because user only needs to work within org-new.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Tool Library and App Introduction

    Before you begin, review the overview documentation and log onto the DNAnexus Platform

    Tool Library

    The tool library is a set of ready to use apps and work flows that are maintained by DNAnexus

    There are different categories and you can search by name of the tool.

    Steps to finding Tool Documentation

    1. Navigate to the Tool Library

    2. In the Any Name search box, start entering "FASTQC...."

    3. Click on the tool name, and you will be at the info tab of the tool.

    4. Select the Version: If you want the same version that is loaded automatically, this is all that you will need to do. If you want a different Version, select the Versions tab and select which version you want.

    You can also select "Run" to run the app

    Tool Runner Options

    There are 2 options for running the tool. First, select "Run" where you find the tool documentation.

    Then, there are 2 different UIs for setting up the app to run:

    The guided set up, which is what you normally start with

    Or, the I/O graph

    Tool Useful Features

    In the Stage settings tab, you can set the version of the app you want to use, instance type and specify the output folder. By specifying the instance type, you will set the computational resources of the machine on which the analysis will be run. For example, if your input data is large, you will choose an instance type with more storage space available.

    Required inputs indicated by asterisks, some are optional.

    It is point and click.

    Can select your instance here.

    can be enabled here. At this time, the feature applies to a batch of inputs. The output is aggregated in one output file. (e.g. 10 inputs results in 1 output).

    Running and Monitoring an App

    Set Up

    • Once you have selected the app you want to use and read the documentation (if applicable), you will use the guided setup to run the app in the UI.

    • Set the Output folder

    • Set the inputs. In the example of FASTQC, it is one FASTQ file

    • Launch the app using the start analysis button in the upper right

    Monitor An App

    • You will automatically be redirected to the monitor page

    • When the job is completed, you will have buttons to access the inputs (such as a FASTQ file) and outputs (such as an HTML file).

    • Here is the view when the app is completed:

    Supplemental Information

    Using Apps in the GUI

    Batch Processing in the GUI

    Monitoring An App/ Workflow

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Introduction to JupyterLab

    New to JupyterLab?

    If you have never used a JupyterLab notebook before, please view this information:

    • Jupyter Notebook Documentation

    Introduction

    We can interact with the platform in several different ways and install software packages in these different environments depending on what we are wanting to use and how we want to use it. As shown in the diagram below, we will be explaining Jupyter Lab Python/R/Stata and Spark JupyterLab Python/R:

    Why JupyterLab?

    Data Scientists’ tasks can be interactive. Options for interactive analysis in JupyterLab are:

    • Notebook-based Analysis

    • Exploratory Data Analysis (EDA)

    • Data Preprocessing/ Cleaning

    • Implementing New Machine Learning(ML)/ Model

    Requesting an Instance

    Use Single DXJupyter Instance if:

    • The work can be done on a single machine instance

    • Main Use Cases:

    • Python/R

    • Image Processing

    Use Spark Cluster DXJupyter If:

    • Working with very large datasets that will not fit in memory on a single instance

    • Using the Cohort Browser and querying a large ingested dataset

    • Needing to use Spark based tools such as dxdata, HAIL or GLOW

    Starting a JupyterLab Job

    1. Select JupyterLab with Python, R, Stata, ML, Image Processing or JupyterLab from Spark from the Tool Library, or select “Start Analysis” from the project space and select JupyterLab from the tool list. Once selected, press “Run Selected”

    1. Select the output location, and change the job name if desired.

    1. Then, select the inputs you intend on using

      1. Snapshot file (not required, and how to create a snapshot is in the Utilizing Snapshot section)

      2. Input files (not required, can do in the notebook analysis)

      3. Stata settings file (license required for Stata)

    1. Then, press “Start Analysis” in the far right corner

    1. Next, confirm the following parameters:

      1. Job Name

      2. Output Folder

      3. Priority (defaults to normal, can be set to high)

    1. Then, press “Launch Analysis”

    2. When redirected to the monitor tab, select the job name

    3. It will redirect you to the details of the JupyterLab job. Wait for the job to start running, and for the worker URL to appear

    4. Press “Open Worker URL” and the JupyterLab home page will appear

    1. Note: Sometimes, the job is still initializing, so if you press Open Worker URL immediately, it may show a 502 error message. This is okay, and the job will update when the job is finished initializing.

    Running instances may take several minutes to load as the allocations become available.

    Error Strategies for Nextflow

    Nextflow's errorStrategy directive allows you to define how the error condition is managed by the Nextflow executor at the process level.

    There are 4 possible strategies:

    errorStrategy
    Description

    terminate (default)

    terminate all subjobs as soon as any subjob has an error

    finish

    when any subjob has an error, do not start any additional subjobs and wait for existing jobs to finish before exiting

    ignore

    pretend you didn't see it..just report a message that the subjob had an error but continue all subjobs

    retry

    The DNANexus nextflow documention has a

    Generally the errorStrategy is defined in either the base.config (which is referenced using includeConfig in the nextflow.config file) or in the nextflow.config file.

    In nfcore pipelines, the default errorStrategy is usually defined in base.config and it is set to 'finish' except for error codes in a specific numeric range which are retried.

    The code below is from the

    The maxRetries directive allows you to define the maximum number of times the exact same subjob can be re-submitted in case of failure and the maxErrors directive allows you to specify the maximum number of times a process (across all subjobs of that process executed) can fail when using the retry error strategy. .

    In the code above, if the exit status of the subjob (task) is within 130 to 145, inclusive, or is equal to 104, then it will retry that subjob once (maxRetries = 1). If other subjobs of the same process also have the same issue, they will also be retried once (maxErrors = '-1' disables the max number of times any process can fail so if every subjob executed for a particular process failed it will allow it to be retried the number of times set in maxRetries). Otherwise, the finish errorStategy is applied and the subjob is terminated pending but other running non-errored subjobs are allowed to complete.

    For example, imagine you have a fastqc process that takes in one file at a time from a channel with 3 files (file_A, file_B, file_C)

    The process is as below and is run for each file in parallel

    • fastqc(file_A)

    • fastqc(file_B)

    • fastqc(file_C)

    If the subjob with file_A and the subjob with file_C fail first with errors in range 130-145 or with a 104 error, they can each be retried once if maxRetries =1 .

    Now imagine that you set maxErrors = 2. In this case, there are 3 instances of the process but only 2 errors are allowed for all instances of the process. Thus, it will only retry 2 of the subjobs e.g. fastqc(file_A) fastqc(file_C)

    If fastqc(file_B) encounters an error at any point, it won't be retried and then the whole job will go to the finish errorStrategy.

    Thus, disabling the maxErrors directive by setting it to '-1' allows all failing subjobs with the specified error codes to be retried X amount of times with X set by maxRetries.

    Debugging Checklist for Errors

    • Check what version of dxpy was used to build the Nextflow pipeline and make sure it is the newest

    • Look at head-node log (hopefully it was ran with "debug mode" as false because when true, the log gets injected with details which isn't always useful and can make it hard to find errors)

      • Look for the process (sub-job) which caused the error, there will be a record of the error log from that process, though it may be truncated

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

    Cohort Combine

    Please note: in order to use Cohort Browser on the Platform, an Apollo License is needed.

    Cohort Combine Logic

    Cohort combine logic allows you to combine existing cohorts with Boolean Logic operations

    Overview

    Here is a summary of the functions for Cohort Combine

    Rules and Boundaries

    • All cohorts must be from the same dataset.

    • All cohorts must be saved before being combined.

    • A cohort that is a result of combine cannot be combined a second time.

    • Cohorts from different projects can be combined if they use the same underlying database.

    How To:

    1. Add your cohorts into the cohort browser by selecting “Load Saved Cohort” if the cohort has already been created and saved into the project, or “New Cohort” if a new cohort needs to be created. You can select up to 10 cohorts to load into the side menu.

    1. Pick your cohort and add it to the browser. It will look like this.

    1. At the bottom of the cohort tab, select "Combine Cohorts"

    1. You will then have the following screen to combine. Pick your cohort combine logic, then select combine

    Examples of Cohort Combine Functions

    Intersection

    Overview:

    Example:

    Union

    Overview:

    Example:

    Subtraction

    Important note: the order of the cohorts matters in this.

    Overview:

    Example 1:

    Example 2:

    Unique

    Overview:

    Example:

    Complement

    Important notes:

    • Cohort must be saved before creating its complement (same rule as previous)

    • A combined cohort (Intersection, Union, Subtraction, Unique) can be used to create a complement.

    • A cohort created as a complement cannot be further used for combine / complement.

    Overview:

    Example:

    Publishing Applets to Apps

    Why would you transition from an applet to an app?

    Applet

    App

    Purpose

    Early development, experiment with analyses

    Applet is stable, ready to use and possibly moved to a wider audience

    Differences between an App and Applet

    Checklist of items to keep in mind:

    When publishing an app, the following items are needed:

    1. A working applet that you have tested

    2. A name that is unique. Generally, the recommendation is to have an abbreviation for your org as part of the name. Example: If the org is named “academy_demos” and the app is for fastqc, then the name of the app could be “academy-fastqc”, “academy_demo-fastqc”, or “academydemo-fastqc”.

    3. Documentation to add to a README.md for users to understand what your app does

    4. Developer notes for you to keep track of version information and added to the Developer README.md

    To Publish the App

    1. Use dx get applet-name to have the most recent version of your applet

    2. Make your changes to the dxapp.json

    3. Then use dx build app_name --publish --app

    Helpful Trick

    • Forget to add users or need to add more users? Use:

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select “Contact Support”

    3. Fill in the Subject and Message to submit a support ticket.

    Overview of the Platform User Interface

    Before you begin, set up a DNAnexus Platform account here: https://platform.dnanexus.com/login

    Basic Structure

    The Platform

    There are several ways to interact with the platform. All of these will be covered in future lessons/ courses/ documentation.

    We are going to be focused on the user interface (highlighted in green), also known as UI.

    Projects

    This information can also be found in the for the Platform

    First, what is a project?

    • It is a collaborative workspace

    • The smallest unit of sharing on the platform

    • A place to store objects that are made on the platform

      • Examples of these objects can be files, applets, and workflows

    Folder Usage

    • The user folder is the storage area for your output files

    • You can add more folders into your user folder for organization (maybe one for data, each project, etc. This is however you and your organization/ company wants to do this)

    Status of a Data Object

    Data can be in one of 3 states

    • Open: initial, empty state, awaits upload

    • Closing: uploading, not instantaneous

    • Closed: Finalization completed, available for next steps

    Creating Folders

    1. Log into the

    2. When you login, you will see a list of projects that you are a part of.

    3. Navigating to a project

      1. We have prebuilt projects for you

    Copying Files

    • Copying means from one project to another project

    • You cannot copy within the same project because of the file ID.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Running a JupyterLab Notebook

    Use Cases for a Single JupyterLab Instance

    • R needs to run in a regular notebook and for downstream analysis

    • If directly interacting with the database/ dataset, it is recommended that you either 1) use Python and/ or 2) use Spark for extracting the data that is relevant for the downstream analysis

    General “Recipe” for Utilizing Single Instance JupyterLab Notebooks

    1. Create a DX JupyterLab Notebook so that it will automatically save onto the Trusted Research Environment. You can do so by selecting these 2 different options:

      1. Option 1 is from the Launcher:

      b. Option 2 is from the DNAnexus Tab:

    1. Start writing your JupyterLab Notebook. Select which kernel you are going to use (options will vary depending on the Image you selected in set up).

    2. Download packages and save the software environment as a snapshot

      1. Download Packages

      b. Save the Snapshot of the environment

    Opening Notebooks from Project Storage

    • Notebooks can also be directly opened from project storage

    • When you save in JupyterLab, the notebook gets uploaded to the platform as a new file. This goes back to the concept of immutability.

    • The old version of notebook goes into .Notebook_archive/ folder in project.

    Explorer Mode

    A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).

    A Note on Data:

    The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads

    The citation for this synthetic dataset is:

    Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007

    PygWalker and Explorer Mode

    PygWalker shows a sample of the dataset in a table format

    PygWalker simplifies data analysis and visualization by transforming pandas dataframes into an interactive interface for easy exploration. It is available within the table-level view of the application. To use it, simply click the Go to Explorer Mode button to access the raw data slide. You can learn more about its features by referring to the documentation or watching demo videos .*

    A custom plot created with PygWalker

    *DNAnexus is not responsible for the accuracy or updating of any 3rd party content or applications*

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Job Failures

    If your nextflow run fails, the nextflow job log is written to your project Output location (CLI flag --destination) that you set for the applet at runtime.

    However, on failure, your results files in params.outdir are not written to the project, unless you are using the 'ignore' error strategy.

    To guard against having long running or expensive (or both!) runs that you get no output from when they fail you need to think carefully about what should happen when your job fails and if you need the ability to resume it. This means that successfully completed processes won't be run again saving you the cost and time of running already successfully completed jobs.

    Nextflow has a resume feature to enable runs that fail to be resumed again which

    Caching the Nextflow workDir

    Table Level Screen

    A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via ).

    A Note on Data:

    The data used in this section of Academy documentation can be found here to download:

    The citation for this synthetic dataset is:

    Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007.

    Branding JSON File

    Disclaimer: Portals require a license. These documents are to get you started with your portals. By no means is this the only way to make your portal, nor is this the only way to edit a json file.

    Overview of the branding.json file

    • this .json file will edit the branding portion of the portal.

    Cloud Computing for HPC Users

    HPC vs the DNAnexus Platform

    Component
    HPC
    DNAnexus Platform
    pip install ___ #python
    docker pull broadinstitute/gatk 
    docker save broadinstitute/gatk -o gatk.tar.gz
    dx upload gatk.tar.gz 
    {
    
    }
    {
     "_projects": null, #deletes the current list of projects 
     "_tools": [
      {"text": "Custom Menu Item", "url": "http://example.com"}, #creating a new item within tools 
      {"text": "Opens in New Tab", "url": "http://example.com", "newTab": true} #creating a new tab in tools 
     ],
     "_help": null, #removes help 
     "A New Menu": [
      {"text": "New Menu Item", "url": "http://example.com"}, #new menu 
     ],
     "A New Link": {"url": "http://example.com", "newTab": true} #new link 
    }
    Building Workflows
    ML
  • Stata

  • Update the Duration if desired

  • Add Commands to run in the JupyterLab environment (optional)

  • Finally, update the Feature. For a full list of packages in each feature, please look in the Preinstalled Packages List. The options are

    • Python_R

    • ML

    • IMAGE_PROCESSING

    • STATA

    • MONAI_ML

  • Spending Limit (optional)

  • Instance Type (change the default value if needed)

  • Try Jupyter
    Jupyter Architecture
    Content Community

    Cohort combine are very complex queries

  • Beware of performance delays and timeouts as query gets more complex.

  • Use extra caution when:

  • Combining cohorts with genomic filters

  • Combining cohorts with complicated filters

  • Combining cohorts based on very large datasets

  • billable activities

  • shared apps

  • shared projects

  • are either allowed or not allowed to access

    • billable activities

    • shared apps

    • shared projects

    • is a single user on the platform

    • can be an org admin or an org member

    • they can also just be added to project and NOT a member of an org, but they will not see pricing or have access to org specific options unless part of the project itself.

    • holds one of 4 types of permissions to a project

  • could be to limit how the data is handled

  • Can be changed from all members to no one

  • Delete Access

    • Limit how the data is handled

    • Can be changed from Contributors and Admins to Admins only

  • Download Access

    • Limit who can see the data (this would allow accessing the data outside of the platform)

    • Can be changed from all members to no one

  • Org admins can define projects and project access

  • Introduce apps and app access

  • Finally, it's very easy to revoke permissions within an org. So say for example, user C moves off the project or moves to a different institution. One of the two admins admin A or D can remove users see from the org.

    Looking at permissions associated with each of these users, admin A, and users B and C, have access only to org. Whereas admin D and user E have access to both org and org-new. And user F only has access to org-new.

  • Orgs are flexible tools used to represent groups of users that can be used to simplify resource sharing, consolidate, billing, and associating platform work to real world billing structures.

  • Organization Membership Guide
    Orgs Documentation
    Org Management
    Full Documentation
    Look at the failed sub-job log
  • Look at the raw code

  • Look at the cached work directories

    • .command.run runs to setup the runtime environment

      • Including staging file

      • Setting up Docker

    • .command.sh is the translated script block of the process

      • Translated because input channels are rendered as actual locations

    • .command.log, .command.out etc are all logs

  • Look at logs with "debug mode" as true

  • when a subprocess returns an error, retry the process

    very detailed description of what happens for each errorStrategy
    sarek base.config
    See this github issue for more of an explanation
    Full Documentation

    None is present at the applet creation

    Each time the app is built, it must be given a new version.

    A default spending account set up for yourself as the app author. For published apps, they will require storage for their resources and assets, and the storage will be billed on a monthly basis to the billing account of the original author of each app. You can set multiple authors, but the original author is where the billing is tied to.

  • Decide if you want the app to be open source. In dxapp.json, add a key called "openSource" with a boolean value of true or false.

  • A consistent version policy for your meaningful updates. DNAnexus suggests Semantic Versioning.

  • Add authorized users. In dxapp.json, add a key called "authorizedUsers" with a value being an array of strings, corresponding to the entities allowed to run the app. Entities are encoded as user-username for single users, or org-orgname for organizations.

  • Perks of Each

    Easy to collaborate, members of the project can edit the code, and publish

    Once published, the app cannot be modified version control enforced can carry assets in their own private container.

    Goal

    Adding executable in to an application for increased efficiency in usage + ability to edit code efficiently

    Adding executable in to an application for increased efficiency in usage + enhance reproducibility and minimize risk

    Applets

    Apps

    Location

    in projects

    in the Tool Library, if you are the developer or an authorized user

    Naming Structure

    project:/applet_ID

    project:/folder/name

    app-name

    Can they be shared?

    Through projects, as a data object

    App developer manages a list of users authorized to access the app

    Updating

    Deleting the previous applet with the same name, and creating a new one

    New version per release

    Transitioning from Applets to Apps
    App Metadata

    Versioning

    Start writing your code.
    1. Load Packages

    b. Download or Access data files to the JupyterLab environment

    c. Import the data

    d. Then, perform the analysis for your data

    e. Upload results back to Project Space

  • Save your DX Jupyterlab Notebook

  • import dxdata
    import pprint
    import pyspark
    from pyspark.sql import functions as F
    dx extract_dataset dataset_id -ddd --delimiter 
    sc = pyspark.SparkContext()
    spark = pyspark.sql.SparkSession(sc)
    %%bash 
    dx upload FILE --destination /your/path/for/results
        // memory errors which should be retried. otherwise error out
        errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
        maxRetries    = 1
        maxErrors     = '-1'
    dx add users USER OR ORG NAME OF APP
    pip install ___ #python
    install.packages() #R
    import ____ #python
    library() #R
    %%bash 
    #option 1: dx download 
    dx download "PATH TO FILE"
    
    #option 2: dx fuse 
    data = pd.read_csv("/mnt/project/PATH.csv")
    import ___ as pd 
    NAME = pd.read_csv("PATH.csv")
    %%bash 
    dx upload FILE --destination /your/path/for/results

    Notice, you will automatically return to the Info tab for that version.

    You will have a review step. This is to review the content as well as add additional parameters such as a spending limit.

    Batch analysis
    User Interface QuickStart Guide
    Tool Library List
    Full Documentation

    A place to contain details of running jobs/ analyses and their results

    In your project space, select "DNAnexus Academy 101"

  • Navigate to the users folder and use Add > New Folder

  • Project Section of the Documentation
    DNAnexus platform
    Key Concepts
    User Interface QuickStart Guide
    Full Documentation
    To be able to resume a run that failed you need to set preserve_cache to true for the initial run. This will cache the nextflow workDir of the run in your project on platform in a folder called .nextflow_cache_db/<session_id>/.

    The session ID is a unique ID given to each (non-resumed) Nextflow run. Resumed Nextflow runs will share the same session ID as the run that they are resuming since they are using the same cache.

    The cache is the nextflow workDir which is where nextflow stores each tasks files during runs. By default when you run a nextflow applet, preserve_cache is set to false. In this state, if the applet fails you will not have the ability to resume the run and you are not able to see the contents of the work directory in your project.

    To turn on preserve_cache for a run add -ipreserve_cache=true to your run command.

    In the UI, scroll to the bottom of the Nextflow run setup screen

    So if you are running a job and think there is a chance that you might want to resume it if it fails, then turn on preserve_cache.

    Note that if you terminate a job manually i.e., using the terminate button in the UI or with dx terminate the cache will not be preserved and you will not be able to resume the run even if preserve_cache was set to true for the run. The same applies if a job is terminated due to a job cost limit being exceeded. Essentially, if it is not the DNAnexus executor terminating the run, then the cache is not preserved and so resuming the run is not possible.

    Cache limits

    You can store up to 20 caches in a project and a cache will be stored for a maximum of 6 months. Once that limit has been reached you will get a failure if you try to run another job with preserve cache switched on. In practice you should regurlary delete your cache folders once you have had successful runs and no longer need them to save on storage costs.

    Resuming a run

    You can make changes to the Nextflow applet, dx build it again and/or make changes to the run inputs before resuming a run.

    When you resume a run in the CLI using the session ID, the run will resume from what is cached for the session id on the project.

    Only one Nextflow job with the same session ID can run at any time.

    When resume is assigned with 'true' or 'last, the run will determine the session id that corresponds to the latest valid execution in the current project and resume the run from it

    or

    To setup the sarek command to preserve the Cache

    To resume a sarek run and preserve updates to the cache from the new run (which will allow further resumes in case this resumed run fails) use the code below:

    To get the session-id of a run, click the run in the monitor tab of your project and scroll down to the bottom of the page. On the bottom right you should see the session ID in the 'Properties' section

    If you know your job ID, you can also use that to get the session ID on the CLI using

    Debugging Checklist for Errors

    • Check what version of dxpy was used to build the Nextflow pipeline and make sure it is the newest

    • Look at head-node log (hopefully it was ran with "debug mode" as false because when true, the log gets injected with details which isn't always useful and can make it hard to find errors)

      • Look for the process (sub-job) which caused the error, there will be a record of the error log from that process, though it may be truncated

    • Look at the failed sub-job log

    • Look at the raw code

    • Look at the cached work directories

      • .command.run runs to setup the runtime environment

        • Including staging file

        • Setting up Docker

    • Look at logs with "debug mode" as true

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    can be used on DNAnexus
    Table Level Screen

    The Table-level screen appears when the user selects one particular table in the Navigator.

    Table-level Screen of a table in Data Profiler

    Table Overview

    Overview details on the header of the Table-level screen

    On the header of the Table-level screen, the user can find overall statistics on the selected table, that include:

    • Table size: number of rows and columns of the table

    • Missing rate: the rate of empty cells in the table

    • Duplicate rate: the rate of duplication of an entire row in the table

    Composition of Column Types

    Pie chart of Column types on the header of the Table-level screen

    The pie chart shows the composition of column types in the table. The size of each part of the pie is determined by the number of columns of that type. The user can also hover on the chart to get the count value.

    Table-level charts

    Table-level screen has a Controller section that configures the visualization in the Chart area

    The main function of the Table-level Screen is the Chart Area, which is controlled by a Controller in the top right corner of the screen. There are 2 main types of visualizations: Completeness and Column Profiles.

    Completeness

    Completeness is the default mode of the Table-level screen. It aims to provide an overview on the count/rate of non-null values in a table. Completeness has 2 options: One-way view and Two-way view

    One-way View: Bar chart

    One-way view in Table-level screen

    One-way view is a stacked bar chart that displays the percentage of missing values, non-duplicates, and duplicates for each column in the table. You can click on the Legend/Key to show or hide specific statistics on the chart. Hover over each column to view detailed statistics.

    Two-way View: Heat map

    Two-way view in Table-level screen

    Two-way view is a heat map showing data completeness for all columns in the table. The Y-axis of the heatmap is the columns of the table. The X-axis of the heatmap is the unique values of the group-by column. The value of the heatmap shows how many entities (in the Raw count mode, or percentage in the Percentage mode) of the table have non-null values on the columns (y-axis) with respect to the value of the group-by column (x-axis). . The user can choose another column as the grouping factor. Each label in this Group-by column is a column in the heat map. Only categorical columns which have a maximum of 30 unique values will show up as the options.

    The Controller of Two-way view

    The numbers in the heat map can be configured in two ways:

    • Raw count displays the exact number of values available in each column.

    • Percentage shows the completeness statistic as a percentage. The completeness statistic ranges from 0 to 100, where 0 means the data is completely missing, and 100 indicates that the data is 100% complete.

    Two-way View: Heat map, cross-table analysis

    The user can also join the current table of another table using the Join with table options. By joining with another table, the user can use a column from that table as the Group-by column.\

    FAQs

    Question: Can I use the Two-way View to check how many female patients have sequencing data?

    Answer: Yes. Assuming that your question involves 2 metadata: patient_sex (from the patient table) and sequencing_run_id (from the sequencing table). The patient and sequencing table are join-able by patient_id. If that is the case, you can open the patient table with the Two-way View; join it with the sequencing table; and choose patient_sex as the Group-by column. On the sequencing.sequencing_run_id, you can see the completeness rate broken down by each sex in patient_sex.

    The heatmap options controller when doing cross-table analysis. We are joining "patients" table into the "observations" table

    Completeness heatmap in case of cross-table analysis. In this example, the main table is "patients", the joined table is "observations". This heatmap shows how many patients who have available data (not-null values) on the fields which respect to the patient race: white, black, asian, native, or other

    Column Profiles

    Column Profiles mode shows each column as a tile. The chart type depends on the type of the column.

    This screen provides detailed statistics and distribution charts for the columns in the table. For all column types, it displays the missing rate and the duplication rate.

    For columns containing string data, it shows the number of unique values and the value frequency, which is represented in a distribution chart.

    For columns containing float data, the screen provides information about the variance, standard deviation, and the value range frequency, which is displayed in a distribution chart. Additionally, a box plot is shown, illustrating the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.

    For columns containing datetime data, the screen displays the variance, standard deviation, and value range frequency on a distribution chart. A box plot is also provided, showing the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    [email protected]
    https://synthea.mitre.org/downloads
    https://doi.org/10.1016/j.ibmed.2020.100007

    You can add different sections, links, projects, etc into the json file

    If you have questions about how to use a json file, please view this section

    Overview of the Sections of a portal and matching json files:

    Example of the branding.json file

    This is the file to create what you see above

    Other Sections in your home.json file

    Header:

    Other parameters to the header section

    Login (optional):

    Register (optional):

    Other Parameters (optional):

    Resources

    Portal Documentation

    Full Documentation

    Please email [email protected] to create a support ticket if there are technical issues.

    Portable Bash System (PBS) or SLURM

    dx-toolkit

    Worker

    Requested from pool of machines in private cluster

    requested from pool of machines in AWS/ Azure

    Shared Storage

    Shared file system for all nodes (Lustre, GPFS, etc)

    Project storage (Amazon S3/ Azure storage)

    Worker File I/O

    Handled by Shared file system

    needs to be transferred to and from project storage my commands on worker

    Key Players with an HPC

    • With an HPC, there is a collection of specialized hardware, including mainframe computers, as well as a distributed processing software framework so that the incredibly large computer system can handle massive amounts of data and processing at high speeds.

    • The goal of an HPC is to have the files on the hardware and to also do the analysis on it. In this way, it is similar to a local computer, but with more specialty hardware and software to have more data and processing power.

    • Your computer: this communicates with the HPC cluster for resources

    • HPC Cluster

      • Shared Storage: common area for where files are stored. You may have directories branching out by users or in another format

      • Head Node: manages the workers and the shared storage

      • HPC Worker: is where we do our computation and is part of the HPC cluster.

    • These work together to increase processing power and to have jobs and queues so that when the amount of workers that are needed are available, the jobs can run.

    Key Players in Cloud Computing

    • In comparison, cloud computing adds layers into analysis to increase computational power and storage.

    • This relationship and the layers involved are in the figure below:

    • Let's contrast this with processing a file on the DNAnexus platform.

      • We'll start with our computer, the DNAnexus platform, and a file from project storage.

      • We first use the dx run command, requesting to run an app on a file in project storage. This request is then sent to the platform, and an appropriate worker from the pool of workers is made available.

      • When the worker is available, we can transfer a file from the project to the worker.

      • The platform handles installing the app and its software environment to the worker as well.

      • Once our app is ready and our file is set, we can run the computation on the worker.

      • Any files that we generate must be transferred back into project storage.

    Key Differences

    • HPC jobs are limited by how many workers are physically present on the HPC.

    • Traditionally, cloud computing has better architecture than an HPC, so the jobs are more efficient.

    Transferring Files

    • One common barrier is getting our files onto the worker from project storage, and then doing computations with them on the worker. The last barrier we'll review is getting the file outputs we've generated from the worker back into the project storage.

    • Cloud computing has a nestedness to it and transferring files can make learning it difficult.

    • A mental model of how cloud computing works can help us overcome these barriers.

    Resolution:

    • Cloud computing is indirect, and you need to think 2 steps ahead.

    • Here is the visual for thinking about the steps for file management:

    Running apps

    Creating apps and running them is covered later in the documentation.

    Apps serve to (at minimum):

    1. Request an EC2/Azure worker

    2. Configure the worker's environment

    3. Establish data transfer

    Why do this with DNAnexus?

    • Highly secure platform with built-in compliance infrastructure

    • Fully configurable platform

      • User can run single scripts to fully-automated, production-level workflows

    • Data transfer designed to be fast and efficient

      • Read and analyze massive files directly using dxfuse

    • Instances are configured for you via apps

      • Variety of ways to configure your own environments

    • Access to the wealth of

      • Largest Azure instances: ~4Tb RAM

      • Largest AWS instances: ~2Tb RAM

    Equivalent Commands

    Task
    dx-toolkit
    PBS
    SLURM

    Run Job

    dx run <app-id> <script>

    qsub <script>

    sbatch <script>

    Monitor Job

    dx find jobs

    qstat

    squeue

    Kill Job

    dx terminate <jobid>

    qdel <jobid>

    Practical Approaches

    • Single Job

      • Use `dx run` on the CLI directly

      • Use `dx run` in a shell script

    • Batch Processing

      • Use a shell script to use `dx run` on multiple files

      • Use dxFUSE to directly access files (read only)

      • / dx run --batch-tsv

    Batch Processing Comparisons

    Component
    HPC Recipe
    Cloud Recipe

    1

    List Files

    List Files

    2

    Request 1 worker/ file

    Use loop for each file: 1) use dx run, 2) transfer file, and 3) run commands

    3

    use array ids to process 1 file/worker

    4

    submit job to head node

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Driver/ Requestor

    Head Node of Cluster

    API Server

    Submission Script Language

    here
    Full Documentation

    Advanced CLI

    Batch Runs in CLI

    Overview of DX Commands

    There are about 100 dx commands, which you can find by executing dx help all:

    • add: Add one or more items to a list

    • add developers: Add developers for an app

    • add member: Grant a user membership to an org

    Review

    You are now able to:

    • Describe how to use metadata and the dx find data command on the CLI

    • Create and use batch file processing using the CLI

    • Describe the use cases that warrant the Cloud Workstation

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Example 4: cnvkit

    There is an existing public Docker image available for CNVkit ("etal/cnvkit:latest"), so another option is to build a WDL version that will download and use this image at runtime rather than installing the Python and R modules ourselves.

    In this example, you will:

    • Use WDL and Docker to build the CNVkit

    Getting Started

    To start, create a new directory called cnvkit_wdl parallel to the bash directory. Inside this new directory, create the file workflow.wdl with the following contents:

    Next, ensure you have a working Java compiler and then download the latest dxCompiler Jar file. You can use the following command to place the 2.10.3 release into your home directory:

    Use the dxCompiler to turn workflow.wdl into an applet equivalent to the bash version. In the following command, the workflow and all related applets will be placed into a workflows directory in the given project to keep all this neatly contained. The given the project ID project-GFf2Bq8054J0v8kY8zJ1FGQF is the caris_cnvkit project, so change this to if you wish to place this into a different project. Note the use of the -archive option to archive any existing version of the applet and allow the new version to take precendence and the -reorg to reorganize the output files. As shown in the following command, successful compilation will result in printing the new workflow's ID:

    Run the new workflow with the -h|--help flag to verify the inputs:

    As with the bash version, you can launch the workflow from the CLI as follows:

    The resulting output will show the JSON you can alternatively use to launch the job:

    Following is the command you can use to launch the workflow from the CLI with the JSON file:

    As before, you can use the web interface to monitor the progress of the workflow and inspect the outputs.

    Saving a Docker Image

    Run the following command to start a new cloud workstation:

    From the cloud workstation, pull the CNVkit Docker image:

    Save and compress the image to a file:

    Add the tarball to the project:

    Update the WDL to use the tarball:

    Build the app and run it.

    Review

    In this chapter, you learned another strategy for packaging an applet's dependencies using Docker and then running the applet's code inside the Docker image using WDL.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Example 2: fastq_quality_trimmer

    In this chapter, you'll learn to create an applet that uses the executable from the FASTX-Toolkit collection of command-line tools for processing short-read FASTA and FASTQ files. You'll use the applet to run FastQTrimmer on a FASTQ file, creating a trimmed reads file that you can then use for further analysis.

    You will learn the following:

    • How to accept an optional integer argument from the user

    • How to add resource files to an applet such as a binary executable that can be used in your applet code

    Starting the Applet

    Run dx-app-wizard mytrimmer to create the mytrimmer applet. You have already added the app name, so you can press enter when prompted. You can add a title and summary if you would like, as well as version.

    Start the input specification with the input FASTQ:

    Next, indicate an optional integer for the quality score:

    Press Enter to skip a third input and move to the output specification, which should define a single output file:

    Press enter to exit the output section.

    Set a timeout policy if you would like.

    Answer the remaining questions to create a bash applet. The applet does not need access to the internet or parent project, and you can choose the default instance type.

    Open the mytrimmer/dxapp.json in a text editor to view the inputSpec:

    To make input file selection more convenient for the user, edit the patterns for the file extensions of the input_file to be those commonly used for FASTQ files:

    These patterns are used in the web interface to filter files for the user, but it's not a requirement that the input files match these patterns. The file filter can be turned off by the user, so these patterns are merely suggestions.

    Adding a Binary Resource

    Next, you will add a binary executable file from the FASTX toolkit. Download and unpack the FASTX toolkit binaries:

    Then make the executable with the make file. This will create your executable.

    The files are also here to download and for you to unpack:

    Create the directory resources/usr/bin inside the mytrimmer directory:

    When the app is bundled, the directory structure in the resources directory will be compressed and unpacked as is on the instance, so you should create a directory that is in the standard $PATH such as /usr/bin or /usr/local/bin.

    This applet only requires the fastq_quality_trimmer binary, so copy it to the preceding directory:

    You should remove the downloaded binary artefacts as they are no longer needed.

    Writing the Applet

    Update mytrimmer/src/mytrimmer.sh with the following code:

    • The variables $input_file and $input_file_name are based on the inputSpec name input_file. The first is a record-like string {"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}, while the latter is the filename small-celegans-sample.fastq.

    • The variable $input_file_prefix is the name of the input file without the file extension, so small-celegans-sample, which is used to create the output filename small-celegans-sample.filtered.fastq. See .

    You don't need to indicate the full path to fastq_quality_trimmer because it will exist in the directory /usr/local/bin, which is in the standard $PATH.

    Creating a Project for the Data and Applet

    Add the sample FASTQ file to the project either by using the URL importer as shown in Figure 6, or download the file to your computer and upload through the web interface or using dx upload:

    Use dx build to build the applet:

    Run the applet with the -h|--help flag from the CLI to see the usage:

    Run the applet using the file ID of the FASTA file you uploaded:

    The job's output should end with something like the following:

    You can select the output file and view the results.

    You can download the output file and check that the filtering actually removed some of the input sequences by using wc to count the original file and the result:

    Run the applet with a higher quality score and verify that the result includes even fewer sequences.

    Review

    In this chapter, you learned how to do the following:

    • Indicate an optional argument with a default value

    • Add a binary executable to a project in the resources directory and use that binary in your applet

    • How to use variations on the input file variables to get the full filename or the filename prefix without the extension.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Utilizing Data Profiler Navigator

    A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).

    A Note on Data:

    The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads

    The citation for this synthetic dataset is:

    Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007

    How to Navigate in Data Profiler

    Overall about Navigation

    Data Profiler helps the user explore different levels of a dataset. There are 3 levels of a dataset in Data Profiler:

    • Dataset level: Show relationships between tables in the dataset and overview of all tables, columns in the dataset

    • Table level: Show statistics of one particular table. It can also join with another table to create a joint profile.

    • Column level: Show statistics of one particular column of a table. It can also combine with other columns in the same table to create a joint profile.

    To navigate between these 3 levels, the user can select from a navigator on the left side of the application. Once an option of the navigator is selected, the content of the main interface will change accordingly.

    The user interface of Data Profiler consists of a navigator (left, highlighted in blue), which controls the content of the main section (right, highlighted in green).

    Navigator

    Navigator controls the content on the main section of Data Profiler. The main component of the Navigator is a hierarchical structure of the dataset, called Data Hierarchy

    All Tables

    The top level of a Data Hierarchy is All Tables, indicating the dataset level. This level is selected by default.

    Under All Tables are individual tables in the dataset. Each table has a number on the far right indicating the number of columns in the table.

    Data Hierarchy

    Once a table is selected, the Data Hierarchy will show all columns from that table. Each column has a colored tag indicating the column type.

    Searching for Columns

    Above the Data Hierarchy, the user can search for one or more columns. The Data Hierarchy will show tables that have at least one of the column names in the search list (OR logic).

    Explorer Mode

    At the bottom of the Navigator, the user can switch to an Explorer Mode to create charts on their own. The functionality of this mode is discussed in another section of this document.

    The 📜 button shows the Inference Logs Screen that show details on the profiling process. This feature is in development.

    Column Types

    The type of a column in Data Profiler can be specified in a data_dictionary. If that information is not available, Data Profiler will infer the column type based on the content of the column.

    In Data Profiler, there are 4 column types. These types are consistent with the data types used in the via Data Model Loader on DNAnexus platform:

    Null (or empty) values are allowed in all column types and they do not affect how a column type is determined.

    FAQs about Columns

    • In my data_dictionary, the type of column A is “integer”. After loading with Data Profiler, the application says column A is a “string” column. What happened?

    • There is at least one non-null arbitrary value in column A that cannot be cast to an integer. Therefore, the Data Profiler falls back to “string”.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select “Contact Support”

    3. Fill in the Subject and Message to submit a support ticket.

    Overview of Nextflow

    How Nextflow Works Locally

    Nextflow pipelines are composed of processes e.g., a task such as fastqc would be one process, then read trimming would be another process etc. Processes pass files between them using channels (queues) so every process usually has an input and output channel. Nextflow is implicitly parallel - if it can run something in parallel, it will! There is no need to loop over channels etc.

    For example you could have a script with a fastqc and read_trimming processes which take in a fastq reads channel. As these two process have no links between them they will be run at the same time.

    The Nextflow workflow file is called main.nf.

    Lets think about a quick workflow that takes in some single-end fastq files, runs fastqc on them, then trims them, runs fastqc again and finally runs multiqc on the fastqc outputs.

    An example of code that would achieve the workflow in the image (not showing what each process script looks like here)

    An example local run (not on or interacting with DNAnexus) would look like the command below. This assumes you have Nextflow on your own local machine, which is not required for DNAnexus

    As we gave --fastq_dir a default, if your inputs match that default you could just run

    How Nextflow works on DNAnexus

    DNAnexus has developed a version of the Nextflow executor that can orchestrate Nextflow runs on the DNAnexus platform.

    Once you kick-off a Nextflow run, a Nextflow 'head-node' is spun up. This stays on for the duration of the run and it spins up and controls the subjobs (each instance of a process).

    Head Node

    • orchestrates subjobs

    • contains the Nextflow output directory which is usually specified by params.outdir in nfcore pipelines

    • copies the output directory to the DNAnexus project once all subjobs have completed (--destination)

    Subjobs

    • one for every instance of a process

    • each subjob is one virtual machine (instance) e.g., fastqc_process(fileA) is run on one machine and fastqc_process(fileB) is run on a different machine

    • Uses a Docker image for the process environment

    • Required files pulled onto machine and outputs sent back to head node once subjob completed

    Work Directory

    • Nextflow uses a 'work' directory (workDir) for executing tasks. Each instance of a process gets its own folder in the work directory and this directory stores task execution info, intermediate files etc.

    • Depending on if you choose to or not, you will be able to see this work directory on the platform during/after your nextflow run.

    • Otherwise, the work directory exists in a and it will be destroyed once a run has completed.

    Note about Batch Processing

    You may have learned about batching some inputs for WDL workflows previously. You do not need to do this for Nextflow applets - all parallelisation is done automatically by the Nextflow.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

    Column Level Screen

    A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).

    A Note on Data:

    The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads

    The citation for this synthetic dataset is:

    Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007

    String Column

    Column-level screen shows a string column

    For columns containing string data, the data profiler will display several statistics and charts to help analyze the data.

    The statistics include:

    • The missing rate, expressed as a percentage of the missing values in the column.

    • The number of unique values present in the column.

    The charts provided include:

    • Top Records Bar Chart: This chart displays the top values that occur most frequently in the column. You can select how many top records to display using a dropdown list. By hovering over the bars, you can see the exact count of records for each value.

    • Character Length Distribution Chart: This chart shows how the lengths of the strings are distributed. By hovering over different parts of the chart, you can view the range of character lengths and how frequently each range occurs. Besides, the average length of the strings in the column and standard deviation (which measures the amount of variation in the string lengths) are also reported.

    • Boxplot: The boxplot provides a visual summary of the data in terms of its distribution, showing the maximum value, Q3 (upper quartile)

    Float & Integer

    Column-level screen shows a float column

    For columns containing float data, the data profiler provides several statistics and charts to help analyze the data.

    The statistics include:

    • The missing rate, displayed as a percentage of missing values.

    • The standard deviation, which measures the spread of the data values.

    • The Interquartile range, which measures the difference between the 75th and 25th percentiles of the data.

    The charts provided include:

    • Distribution Chart: This chart displays the distribution of values in the column. You can hover over the chart to view the range of values and their frequencies.

    • Boxplot: The boxplot visually represents the distribution of the data, showing the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.

    • Grouping Frequency Chart (Two way plot): This chart shows the frequency of unique values in the current column, grouped with values from another column. You can select the column for grouping from a dropdown list.

    Datetime

    Column-level screen shows a datetime column

    For columns containing datetime data, the data profiler provides several statistics and charts for in-depth analysis.

    The statistics include:

    • The missing rate, displayed as a percentage of missing values.

    • The standard deviation, measuring the dispersion of the datetime values.

    • The Mode, showing the mode/format of the datetime data in the column.

    The charts provided include:

    • Distribution Chart: This chart shows the distribution of datetime values in the column. You can hover over the chart to view the range of values and their frequencies.

    • Boxplot: The boxplot visually represents the distribution of the datetime data, displaying the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.

    • Radar Chart: This chart displays the frequency of values, grouped by year, month, or day. You can change the grouping option using the dropdown at the top.

    Pairwise plot between columns

    Even though each column type has a different layout on the Column-level Screen, Pairwise plot between columns is a common component.

    The user can create a plot between the current column and any other column from the same table. However, not all columns are available for this feature. Data Profiler will show columns that satisfy the following conditions:

    • Not a string column

    • If it is a string column:

      • Not a primary key

      • The number of unique values count is no larger than 30

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    JSON on the Platform

    Be sure to install .

    Background

    (JSON) is a data exchange format designed to be easy for humans and machines to read. You will encounter JSON several places on the DNAnexus platform such as when you create and edit native applets and workflows. As shown in Figure 1, JSON is used to communicate with the DNAnexus Application Programming Interface (API) You will need to understand the responses from the API will help you debug applets, find failed jobs, and relaunch analyses.

    Dataset Level Screen

    A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via ).

    A Note on Data:

    The data used in this section of Academy documentation can be found here to download:

    The citation for this synthetic dataset is:

    Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007.

    Introduction to Datasets

    Please note: in order to use the Cohort Browser on the Platform, an Apollo License is required.

    Disclaimers for the Training Dataset

    dx run applet-xxxx -ipreserve_cache=true
    dx run applet-xxxx -iresume='session-id'
    dx run applet-xxxx -iresume='last'
    dx run applet-xxxx -iresume=true
    dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs_ch' -ipreserve_cache=true -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME' 
    dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs_ch' -ipreserve_cache=true -iresume='last' -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME' 
    dx describe job-ID --json | jq -r .properties.nextflow_session_id
    #ID
    {
     "header": {
    "logo": "#logo_header.png", 
       "logoOpensNewTab": true, 
       "hideCommunitySwitch": true,
       "colors": {
         "background": "#EEEEEE", 
         "border": "#EEEEEE",
         "text": "#000000"  
       }
    }, 
    "homeURL": "http://academy.dnanexus.com" 
    }
    {
     "header": {
       "logo": "#logo_header.png",  #image for the logo; has to be an appropriate size. min 15x15px, max 50x30px
       "logoOpensNewTab": true,  #opens new tab if you select the logo 
       "hideCommunitySwitch": true,
       "colors": {
         "background": "#123456", #background color for the header 
         "border": "#123456", #border color for the header
         "text": "#123456", #text color
       }
       }
     "header": {
       "colors": {
         "hoverBackground": "#123456", #hover background color 
         "userColors": ["#123456", "#234567", "#345678"], #user colors 
         "button": {"success": {"border-color": "green", "background":
          "pink", "hover": {"background": "dusk"}}} #setting colors for buttons or hover selections
       }
    "login": {
       "logo": "#logo_login.png", #image for login 
       "text": "# ADD TEXT IN MARKDOWN FORMAT HERE.",
       "colors": {
         "loginButton": "#123456" #set color for login button here 
       }
    "register": {
       "disable": true,
       "logo": "#logo_register.png", #image for registering 
       "text": "#ADD TEXT IN MARKDOWN FORMAT HERE.",
       "agreeToText": "Plain text you need to agree to before registering", #plain text, string 
       "region": "aws:us-east-1",
       "colors": {
         "registerButton": "#123456" #color for register button 
       }
    "homeURL": "http://example.com", #url for logo 
     "supportURL": "http://example.com/support", #support URL 
     "hideCommunitySwitch": true,
     "description": "A short description of two or three lines for the community selector" #description for the community 
    # Generate batch file by regex
    
    $ dx generate_batch_inputs -iinput_fwd='(.*)_R1_001.fastq.gz' -iinput_rev='(.*)_R2_001.fastq.gz'
    
    # Show the local file
    $ cat dx_batch.0000.tsv
    
    # Use the local batch file
    $ dx run fastp --batch-tsv dx_batch.0000.tsv -iadapter_fa=/data/adapters.fa -iprefix='Sample1'
    add stage: Add a stage to a workflow
  • add users: Add authorized users for an app

  • add_types: Add types to a data object

  • api: Call an API method

  • archive: Requests for the specified set files or for the files in a single specified folder in one project to be archived on the platform

  • build: Create a new applet/app, or a workflow

  • build_asset: Build an asset bundle

  • cat: Print file(s) to stdout

  • cd: Change the current working directory

  • clearenv: Clears all environment variables set by dx

  • close: Close data object(s)

  • cp: Copy objects and/or folders between different projects

  • describe: Describe a remote object

  • download: Download file(s)

  • env: Print all environment variables in use

  • exit: Exit out of the interactive shell

  • extract_dataset: Retrieves the data or generates SQL to retrieve the data from a dataset or cohort for a set of entity.fields. Additionally, the dataset's dictionary can be extracted independently or in conjunction with data. Listing options enable enumeration of the entities and their respective fields in the dataset.

  • find analyses: List analyses in the current project

  • find apps: List available apps

  • find data: List data objects in the current project

  • find executions: List executions (jobs and analyses) in the current project

  • find globalworkflows: List available global workflows

  • find jobs: List jobs in the current project

  • find org members: lists members in the org

  • find org projects: lists projects billed to the org

  • find org apps: lists apps billed to the org

  • find org apps: List apps billed to the specified org

  • find org members: List members in the specified org

  • find org projects: List projects billed to the specified org

  • find orgs: List orgs

  • find projects: List projects

  • generate_batch_inputs: Generate a batch plan (one or more TSV files) for batch execution

  • get: Download records, apps, applets, workflows, files, and databases

  • get_details: Get details of a data object (cf details)

  • head: Print part of a file

  • help: Display help messages and dx commands by category

  • install: Install an app

  • invite: Invite another user to a project or make it public

  • list database: List entities associated with a specific database. For

  • list database files: lists database files associated with a specific database.

  • list developers: List developers for an app

  • list stages: List the stages in a workflow

  • list users: List authorized users for an app

  • login: Log in (interactively or with an existing API token)

  • logout: Log out and remove credentials

  • ls: List folders and/or objects in a folder

  • make_download_url: Create a file download link for sharing

  • mkdir: Create a new folder

  • mv: Move or rename objects and/or folders inside a project

  • new org: Create new non-billable org

  • new project: Create a new project

  • new record: Create a new record

  • new user: Create a new user account

  • new workflow: Create a new workflow

  • publish: Publish an app or a global workflow

  • pwd: Print current working directory

  • remove developers: Remove developers for an app

  • remove member: Revoke the org membership of a user

  • remove stage: Remove a stage from a workflow

  • remove users: Remove authorized users for an app

  • remove_types: Remove types from a data object

  • rename: Rename a project or data object

  • rm: Remove data objects and folders

  • rmdir: Remove a folder

  • rmproject: Delete a project

  • run: Run an applet, app, or workflow

  • select: List and select a project to switch to

  • set_details: Set details on a data object

  • set_properties: Set properties of a project, data object, or execution

  • set_visibility: Set visibility on a data object

  • setenv: Sets environment variables for the session

  • ssh: Connect to a running job via SSH

  • ssh_config: Configure SSH keys for your DNAnexus account

  • tag: Tag a project, data object, or execution

  • terminate: Terminate jobs or analyses

  • tree: List folders and objects in a tree

  • unarchive: Requests for the specified set files or for the files in a single specified folder in one project to be unarchived on the platform.

  • uninstall: Uninstall an app

  • uninvite: Revoke others' permissions on a project you administer

  • unset_properties: Unset properties of a project, data object, or execution

  • untag: Untag a project, data object, or execution

  • update member: Update the membership of a user in an org

  • update org: Update information about an org

  • update project: Updates a specified project with the specified options

  • update stage: Update the metadata for a stage in a workflow

  • update workflow: Update the metadata for a workflow

  • upgrade: Upgrade dx-toolkit (the DNAnexus SDK and this program)

  • upload: Upload file(s) or directory

  • wait: Wait for data object(s) to close or job(s) to finish

  • watch: Watch logs of a job and its subjobs

  • whoami: Print the username of the current user

  • Full Documentation
    ,
    median
    ,
    Q1 (lower quartile)
    , and the
    minimum value
    .
  • Grouping Frequency Chart: This chart displays how often unique values in the current column occur when grouped with values from another column. You can choose the column to group by using a dropdown list.

  • Grouping Frequency Chart (Two Way Plot): This chart shows the frequency of unique datetime values in the current column, grouped with values from another column. You can select the column for grouping from a dropdown list.

  • Full Documentation

    .command.sh is the translated script block of the process

    • Translated because input channels are rendered as actual locations

  • .command.log, .command.out etc are all logs

  • scancel <jobid>

    AWS/Azure resources
    dx generate-batch-inputs
    https://www.youtube.com/watch?v=_jvjgHHVTdYwww.youtube.com
    Full Documentation

    Run fastq_quality_trimmer using the given $quality_score and write to the output filename. The -Q option is an undocumented option to indicate that the scores are in phred 33.

  • Upload the output file, which returns another record-like string describing the newly created file.

  • Add the newly uploaded record as a file output of the job.

  • 5MB
    FASTX.zip
    archive
    Open
    the documentation
    Full Documentation

    Task execution status, temp files, stdout, sterr logs etc sent to work directory

    cache your work directory
    temporary workspace
    Full Documentation
    JSON Examples

    Here is an example of an objects inside other objects describing the output of the FastQC app that creates two files as outputs, one of an HTML report and the other of a text file containing statistics on the input FASTQ:

    In a later chapter, you will use a file called dxapp.json to build custom applets on DNAnexus. To see a full example from a working app, run dx get app-fastqc to download the source code for the FastQC app. This should create a fastqc directory that contains the file dxapp.json.

    Following is a portion of this file showing a typical JSON document you'll encounter on DNAnexus:

    • The root element of this JSON document is an object, as denoted by the curly brackets.

    • The value of inputSpec is a list, as denoted by the square brackets.

    • Each value in the list is another object.

    • The first three values of this object are strings.

    • The patterns value is a list of strings representing file globs that match the input file extensions.

    The following links explain the dxapp.json file in greater detail:

    • Third Party App Style Guide

    • App Metadata

    Validating JSON

    JSON is a strict format that is easy to get wrong if you are manually editing a file. For this reason, we suggest you use text editors that understand JSON syntax, highlight data structures, and spot common mistakes. For instance, a JSON object looks very similar to a Python dictionary, which allows a trailing comma in a list. Open the python3 REPL (read-evaluate-print-loop) and enter the following to verify:

    A similar trailing comma in JSON would make the document invalid. To see this, go to JSONlint.com, paste this into the input box, and press the "Validate JSON" button:

    The result should reformat the JSON onto three lines as follows:

    The second line should be highlighted in red, and the "Results" below show that a JSON value is expected after the last comma and before the closing square bracket.

    Remove the offending comma and revalidate the document to see the "Results" change to "Valid JSON." You may also want to install a command-line tool like jsonlint that can show similar errors:

    Viewing JSON

    JSON is not dependent on whitespace, so the previous example could be compressed to the following:

    The jq program will format JSON into an indented data structure that is easier to read. In the following example, we execute jq with the filter . to indicate we wish to see the entire document, which is the last argument. Depending on your terminal, the keys may be shown in one color and the values in a different color:

    The power of jq lies in the filter argument, which allows you to extract and manipulate the contents of the document. Use the filter .report_html to extract the value for key report_html that lies at the root of the document:

    ::: note If you request a key that does not exist, you will get the JavaScript value null, indicating no value is present: :::

    Filters may chain keys to search further into the document structure. In the following example, we can extract the file identifier by chaining .report_html.dnanexus_link:

    Reading from Unix Pipes

    Unix-type operating systems such as Linux and FreeBSD/macOS have three special filehandles:

    • STDIN (standard in)

    • STDOUT (standard out)

    • STDERR (standard error)

    STDOUT and STDERR control the output of programs where the first is usually the console and the second is an error channel to segregate errors from regular output. For instance, the STDOUT of jq can be redirected to a file using the > operator:

    STDIN is an input filehandle created by using a pipe (|) in the following example:

    Alternatively, you can read from an input redirect using <:

    Using jq For DNAnexus Responses

    Many dx commands can return JSON by appending the --json flag to them. For instance, dx describe app-fastqc will return a table of metadata about the FastQC app. In the following example, I will request the same data as JSON and will pipe it into the head program to see the first 10 lines:

    As with previous examples, the result is a JSON document with an object at the root level; therefore, I can pipe the output into jq .id to extract the app identifier:

    I can use dx find projects --public to view a list of public projects. Using head, I can see the root of the JSON is a list:

    The jq filter .[] will iterate over the values of a list at the root, so I can use .[].id in the following command to extract the project identifier of each. As this returns over 100 results, I'll use head to show the first few lines:

    You can also use pipes inside of the jq filter to extract the same data:

    Recipes for Using jq

    Editing Job Input and Rerunning

    You may wish to re-run an analysis, possibly with slightly different inputs. For this example, I'll use the job.json file rather than using the pipe

    Redirect this to a file:

    ::: note If you had access to the original job ID, you would run the following: :::

    Edit the input.json file, perhaps to indicate a different kmer_size, then re-run the app using the new input:

    Finding Failed Jobs

    Sometimes I find jobs that some jobs have failed when processing large batches of data. I can use dx find jobs --state failed to return a list of failed jobs that I might see if the input files were corrupt or were especially large, causing the instances to run out of disk space or memory. First, I'll show you how to use more advanced filtering in jq. The file jobs.json shows example output from dx find jobs --json that I'll use to extract the state of the jobs:

    A select statement in jq can find the "failed" jobs, and pipes join to more filters to extract the job IDs and the app IDs:

    To be useful in a bash loop, I need the job and app IDs on the same line, so I can use paste for this:

    If I had access to the original executions and input files, I could use a bash loop to re-run these jobs. Since I don't, I'll echo the command that should be run:

    This produces the following output:

    If you were using dx find jobs, then the equivalent would be this:

    Review

    You should now be able to:

    • Describe how users interact with the DNAnexus Platform

    • Explain the purpose of using JSON on the DNAnexus platform

    • Articulate the basic elements of JSON

    • Describe and read basic JSON structures on the platform

    • Parse JSON responses from the platform using jq and pipes to other filters or Unix programs

    Helpful Tips

    • Learn the dxapp.json specification

    • Use an Editor like Visual Studio Code with JSON Crack plugin

    • Use JSON checking tools to make sure your JSON is well formed

      • https://jsonlint.com/

      • Run through jq

    • Use dx get to get app code and dxapp.json for an existing app

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    jq
    JavaScript Object Notation
    version 1.0
    
    task cnvkit_wdl_kyc {
        input {
            Array[File] bam_tumor
            File reference
        }
    
        command <<<
            cnvkit.py batch \
                ~{sep=" " bam_tumor} \
                -r ~{reference} \
                -p $(expr $(nproc) -1) \
                -d output/ \
                --scatter
        >>>
    
        runtime {
            docker: "etal/cnvkit:latest"
            cpu: 16
        }
    
        output {
            Array[File]+ cns = glob("output/[!.call]*.cns")
            Array[File]+ cns_filtered = glob("output/*.call.cns")
            Array[File]+ plot = glob("output/*-scatter.png")
        }
    }
    $ cd && wget https://github.com/dnanexus/dxCompiler/releases/download/2.10.3/dxCompiler-2.10.3.jar
    $ java -jar ~/dxCompiler-2.10.3.jar compile workflow.wdl \
            -archive \
            -reorg \
            -folder /workflows \
            -project project-GFf2Bq8054J0v8kY8zJ1FGQF
    applet-GFyVxpQ0VGFgGQBy4vJ0kxK2
    $ dx run applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 -h
    usage: dx run applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 [-iINPUT_NAME=VALUE ...]
    
    Applet: cnvkit_wdl_kyc
    
    Inputs:
      bam_tumor: [-ibam_tumor=(file) [-ibam_tumor=... [...]]]
    
      reference: -ireference=(file)
    
     Reserved for dxCompiler
      overrides___: [-ioverrides___=(hash)]
    
      overrides______dxfiles: [-ioverrides______dxfiles=(file) [-ioverrides______dx>
    
    Outputs:
      cns: cns (array:file)
    
      cns_filtered: cns_filtered (array:file)
    
      plot: plot (array:file)
    $ dx run -y --watch applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 \
                -ibam_tumor=file-GFxXjV006kZVQPb20G85VXBp \
                -ireference=file-GFxXvpj06kZfP0QVKq2p2FGF \
                --destination project-GFyPxb00VGFz5JZQ4f5x424q:/users/kyclark
    $ cat inputs.json
    {
        "bam_tumor": [
            {
                "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
            }
        ],
        "reference": {
            "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
        }
    }
    $ dx run -y --watch applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 -f inputs.json \
                --destination project-GFyPxb00VGFz5JZQ4f5x424q:/users/kyclark
    $ dx run -imax_session_length="1d" app-cloud_workstation --ssh -y
    $ docker pull etal/cnvkit:latest
    $ docker save etal/cnvkit:latest | gzip - > cnvkit.tar.gz
    $ dx upload cnvkit.tar.gz --path project-GFyPxb00VGFz5JZQ4f5x424q:/
    [===========================================================>]
    Uploaded 503,092,072 of 503,092,072 bytes (100%) cnvkit.tar.gz
    ID                    file-GFyq05j0VGFqJqq54q98pbBK
    Class                 file
    Project               project-GFyPxb00VGFz5JZQ4f5x424q
    Folder                /
    Name                  cnvkit.tar.gz
    State                 closing
    Visibility            visible
    Types                 -
    Properties            -
    Tags                  -
    Outgoing links        -
    Created               Thu Aug 18 03:20:55 2022
    Created by            kyclark
     via the job          job-GFypx3Q0VGFgb71g4gYY3GF3
    Last modified         Thu Aug 18 03:20:57 2022
    Media type
    archivalState         "live"
    cloudAccount          "cloudaccount-dnanexus"
    version 1.0
    
    task cnvkit_wdl_tarball {
        input {
            Array[File] bam_tumor
            File reference
        }
    
        command <<<
            cnvkit.py batch \
                ~{sep=" " bam_tumor} \
                -r ~{reference} \
                -p $(expr $(nproc) -1) \
                -d output/ \
                --scatter
        >>>
    
        runtime {
            docker: "dx://file-GFyq05j0VGFqJqq54q98pbBK"
            cpu: 16
        }
    
        output {
            Array[File]+ cns = glob("output/[!.call]*.cns")
            Array[File]+ cns_filtered = glob("output/*.call.cns")
            Array[File]+ plot = glob("output/*-scatter.png")
        }
    }
    Input Specification
    
    You will now be prompted for each input parameter to your app.  Each parameter
    should have a unique name that uses only the underscore "_" and alphanumeric
    characters, and does not start with a number.
    
    1st input name (<ENTER> to finish): input_file
    Label (optional human-readable name) []: Input file
    Your input parameter must be of one of the following classes:
    applet         array:file     array:record   file           int
    array:applet   array:float    array:string   float          record
    array:boolean  array:int      boolean        hash           string
    
    Choose a class (<TAB> twice for choices): file
    This is an optional parameter [y/n]: n
    2nd input name (<ENTER> to finish): quality_score
    Label (optional human-readable name) []: Quality score
    Choose a class (<TAB> twice for choices): int
    This is an optional parameter [y/n]: y
    A default value should be provided [y/n]: y
      Default value: 30
    Output Specification
    
    You will now be prompted for each output parameter of your app.  Each
    parameter should have a unique name that uses only the underscore "_" and
    alphanumeric characters, and does not start with a number.
    
    1st output name (<ENTER> to finish): output_file
    Label (optional human-readable name) []: Output file
    Choose a class (<TAB> twice for choices): file
      "inputSpec": [
        {
          "name": "input_file",
          "label": "Input file",
          "class": "file",
          "optional": false,
          "patterns": [
            "*"
          ],
          "help": ""
        },
        {
          "name": "quality_score",
          "label": "Quality score",
          "class": "int",
          "optional": true,
          "default": 30,
          "help": ""
        }
      ],
        {
          "name": "input_file",
          "label": "Input file",
          "class": "file",
          "optional": false,
          "patterns": [
            "*.fastq",
            "*.fq"
          ],
          "help": ""
        }
    wget https://github.com/agordon/fastx_toolkit/releases/download/0.0.14/fastx_toolkit-0.0.14.tar.bz2
    tar xvf fastx_toolkit-0.0.14.tar.bz2
    x ./bin/fasta_clipping_histogram.pl
    x ./bin/fasta_formatter
    x ./bin/fasta_nucleotide_changer
    x ./bin/fastq_masker
    x ./bin/fastq_quality_boxplot_graph.sh
    x ./bin/fastq_quality_converter
    x ./bin/fastq_quality_filter
    x ./bin/fastq_quality_trimmer
    x ./bin/fastq_to_fasta
    x ./bin/fastx_artifacts_filter
    x ./bin/fastx_barcode_splitter.pl
    x ./bin/fastx_clipper
    x ./bin/fastx_collapser
    x ./bin/fastx_nucleotide_distribution_graph.sh
    x ./bin/fastx_nucleotide_distribution_line_graph.sh
    x ./bin/fastx_quality_stats
    x ./bin/fastx_renamer
    x ./bin/fastx_reverse_complement
    x ./bin/fastx_trimmer
    x ./bin/fastx_uncollapser
    mkdir -p mytrimmer/resources/usr/bin/
    cp PATH_TO_FASTX/fastq_quality_trimmer mytrimmer/resources/usr/bin/
    #!/bin/bash
    
    set -exuo pipefail
    
    main() {
        echo "Value of input_file: '$input_file'"
        echo "Value of quality_score: '$quality_score'"
    
        dx download "$input_file" -o "$input_file_name" 
    
        outfile="${input_file_prefix}.filtered.fastq" 
    
        fastq_quality_trimmer -Q 33 -t ${quality_score} -i "$input_file_name" -o "$outfile"
    
        outfile_id=$(dx upload $outfile --brief) 
    
        dx-jobutil-add-output output_file "$outfile_id" --class=file 
    }
    wget https://dl.dnanex.us/F/D/Bp43z7pb2JX8jpB035j4424Vp4Y6qpQ6610ZXg5F/small-celegans-sample.fastq
     dx upload small-celegans-sample.fastq
    [===========================================================>]
    Uploaded 16,801,690 of 16,801,690 bytes (100%) small-celegans-sample.fastq
    ID                    file-GJ2k2V80vx88z3zyJbVXZj3G
    Class                 file
    Project               project-GJ2k24j0vx804FPyBbxqpQBk
    Folder                /
    Name                  small-celegans-sample.fastq
    State                 closing
    Visibility            visible
    Types                 -
    Properties            -
    Tags                  -
    Outgoing links        -
    Created               Tue Oct 11 08:52:37 2022
    Created by            kyclark
    Last modified         Tue Oct 11 08:52:53 2022
    Media type
    archivalState         "live"
    cloudAccount          "cloudaccount-dnanexus"
    $ dx build mytrimmer -f
    {"id": "applet-GJ2k5780vx804FPyBbxqpQQ0"}
    $ dx run applet-GJ2k5780vx804FPyBbxqpQQ0 -h
    usage: dx run applet-GJ2k5780vx804FPyBbxqpQQ0 [-iINPUT_NAME=VALUE ...]
    
    Applet: FastQTrimmer
    
    mytrimmer
    
    Inputs:
      Input file: -iinput_file=(file)
    
      Quality score: [-iquality_score=(int, default=30)]
    
    Outputs:
      Output file: output_file (file)
    $ dx run applet-GJ2k5780vx804FPyBbxqpQQ0 \
    > -iinput_file=file-GJ2k2V80vx88z3zyJbVXZj3G -y --watch
    
    Using input JSON:
    {
        "input_file": {
            "$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"
        }
    }
    
    Calling applet-GJ2k5780vx804FPyBbxqpQQ0 with output destination
      project-GJ2k24j0vx804FPyBbxqpQBk:/
    
    Job ID: job-GJ2k5F00vx84k2X3BqqZ5Zpp
    
    Job Log
    -------
    Watching job job-GJ2k5F00vx84k2X3BqqZ5Zpp. Press Ctrl+C to stop watching.
    2022-10-11 16:31:18 FastQTrimmer STDERR + echo 'Value of input_file:
    '\''{"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}'\'''
    2022-10-11 16:31:18 FastQTrimmer STDERR + echo 'Value of quality_score:
    '\''30'\'''
    2022-10-11 16:31:18 FastQTrimmer STDOUT Value of input_file:
    '{"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}'
    2022-10-11 16:31:18 FastQTrimmer STDOUT Value of quality_score: '30'
    2022-10-11 16:31:18 FastQTrimmer STDERR + dx download '{"$dnanexus_link":
    "file-GJ2k2V80vx88z3zyJbVXZj3G"}' -o small-celegans-sample.fastq
    2022-10-11 16:31:19 FastQTrimmer STDERR + outfile=
    small-celegans-sample.filtered.fastq
    2022-10-11 16:31:19 FastQTrimmer STDERR + fastq_quality_trimmer -Q 33
    -t 30 -i small-celegans-sample.fastq -o small-celegans-sample.filtered.fastq
    2022-10-11 16:31:27 FastQTrimmer STDERR ++ dx upload
    small-celegans-sample.filtered.fastq --brief
    2022-10-11 16:31:28 FastQTrimmer STDERR + outfile_id=
    file-GJ2zkYj06GbzP8XBB4bVGxQ6
    2022-10-11 16:31:28 FastQTrimmer STDERR + dx-jobutil-add-output output_file
    file-GJ2zkYj06GbzP8XBB4bVGxQ6 --class=file
    $ dx download file-GJ2k73j08bbkVxK9Gxx8Z891
    [===========================================================>]
    Completed 15,557,666 of 15,557,666 bytes (100%) .../fastq_trimmer/small-celegans-sample.filtered.fastq
    $ wc -l small-celegans-sample.f*
      100000 small-celegans-sample.fastq
       99848 small-celegans-sample.filtered.fastq
      199848 total
    nextflow.enable.dsl=2
    
    //params.fastq_dir will be exposed as a pipeline input and is given a default here
    
    params.fastq_dir = "./FASTQ/*.fq.gz"
    //make a fastq ch
    fastq_ch = Channel.fromPath(params.fastq_dir)
    
    workflow {
    //fastqc 
    // takes in a fastq_ch and outputs a channel with fastqc html and zip files
    raw_fastqc_ch = fastqc(fastq_ch)
    
    //takes in a fastq_ch and outputs a channel with trimmed reads
    trimmed_reads_ch = read_trimming(fastq_ch)
    
    //takes in the trimmed reads channel this time
    trimmed_fastqc_ch = fastqc_trimmed(trimmed_reads_ch)
    
    //combine the two channels together to use them in multiqc 
    combined_fastqc_ch = raw_fastqc_ch.mix(trimmed_fastqc_ch)
    
    //takes in a channel containing fastqc files
    //collect is used here to make all files available at the same time.
    multiqc(combined_fastqc_ch.collect())
    }
    nextflow run main.nf --fastq_dir "/FASTQ/SRR_*.fastq.gz"
    nextflow run main.nf
    {
       "report_html": {
           "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
       },
       "stats_txt": {
           "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
       }
    }
    {
        "name": "fastqc",
        "title": "FastQC Reads Quality Control",
        "summary": "Generates a QC report on reads data",
        "dxapi": "1.0.0",
        "openSource": true,
        "version": "3.0.3",
        "inputSpec": [
            {
                "name": "reads",
                "label": "Reads",
                "help": "A file containing the reads to be checked. Accepted formats are gzipped-FASTQ and BAM.",
                "class": "file",
                "patterns": [
                    "*.fq.gz",
                    "*.fastq.gz",
                    "*.sam",
                    "*.bam"
                ]
            },
        ...
    }
    >>> { 'patterns': [ '*.bam', '*.sam', ] }
    {'patterns': ['*.bam', '*.sam']}
    { "patterns": [ "*.bam", "*.sam", ] }
    {
        "patterns": ["*.bam", "*.sam", ]
    }
    Error: Parse error on line 2:
    ... ["*.bam", "*.sam", ]}
    -----------------------^
    Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got ']'
    $ jsonlint dxapp.json
    Error: Parse error on line 15:
    ...*.sam",            ],            "help
    ----------------------^
    Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got ']'
    $ cat minified.json
    {"report_html":{"dnanexus_link":"file-G4x7GX80VBzQy64k4jzgjqgY"},"stats_txt":
    {"dnanexus_link":"file-G4x7GXQ0VBzZxFxz4fqV120B"}}
    $ jq . minified.json
    {
      "report_html": {
        "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
      },
      "stats_txt": {
        "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
      }
    }
    $ jq .report_html example.json
    {
      "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
    }
    $ jq .report_htm example.json
    null
    $ jq .report_html.dnanexus_link example.json
    "file-G4x7GX80VBzQy64k4jzgjqgY"
    $ jq . minified.json > prettified.json
    $ cat prettified.json
    {
      "report_html": {
        "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
      },
      "stats_txt": {
        "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
      }
    }
    $ cat minified.json | jq .
    {
      "report_html": {
        "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
      },
      "stats_txt": {
        "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
      }
    }
    $ jq . < example.json
    {
      "report_html": {
        "dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
      },
      "stats_txt": {
        "dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
      }
    }
    $ dx describe app-fastqc --json | head
    {
        "id": "app-G81jg5j9jP7qxb310vg2xQkX",
        "class": "app",
        "billTo": "org-dnanexus_apps",
        "created": 1644399511000,
        "modified": 1644401066806,
        "createdBy": "user-jkotrs",
        "name": "fastqc",
        "version": "3.0.3",
        "aliases": [
    $ dx describe app-fastqc --json | jq .id
    "app-G81jg5j9jP7qxb310vg2xQkX"
    $ dx find projects --public --json | head
    [
        {
            "id": "project-F0yyz6j9Jz8YpxQV8B8Kk7Zy",
            "level": "VIEW",
            "permissionSources": [
                "PUBLIC"
            ],
            "public": true,
            "describe": {
                "id": "project-F0yyz6j9Jz8YpxQV8B8Kk7Zy",
    $ dx find projects --public --json | jq ".[].id" | head -3
    "project-F0yyz6j9Jz8YpxQV8B8Kk7Zy"
    "project-G4FX3QXKzJxqXxGpK2pJ7Z3K"
    "project-FGX8gVQB9X7K5f1pKfPvz9yG"
    $ dx find projects --public --json | jq ".[] | .id" | head -n 3
    "project-F0yyz6j9Jz8YpxQV8B8Kk7Zy"
    "project-G4FX3QXKzJxqXxGpK2pJ7Z3K"
    "project-FGX8gVQB9X7K5f1pKfPvz9yG"
    $ jq .input job.json
    {
      "reads": {
        "$dnanexus_link": "file-BQbXKk80fPFj4Jbfpxb6Ffv2"
      },
      "format": "auto",
      "kmer_size": 7,
      "nogroup": true
    }
    $ jq .input job.json > input.json
    $ dx describe job-G4x7G5j0B3K2FKzgP654ZqpK --json | jq .input > input.json
    $ dx run app-G4YyQ9044b90F1vG8y9YkKk3 -f input.json
    $ jq ".[].state" rap-jobs.json | sort | uniq -c | sort -rn
      15 "failed"
       3 "done"
       2 "terminated"
    $ jq '.[] | select (.state | contains("failed")) | .id, .executable' rap-jobs.json | head
    "job-G6jj9k8JPXfG42094KG5JFX4"
    "applet-G6jj9b0JPXf5Q6ZF4G85K156"
    "job-G6jj1zQJPXf34z8v4KqjZKP1"
    "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
    "job-G6jg9vQJPXfGbJb54GFkJ33Y"
    "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
    "job-G6jg7Y0JPXfG6q53G12vQZK8"
    "applet-G6jg6pQJPXf7ypXq33B75Qq1"
    "job-G6jg57QJPXf90Jjv4K8pgkG7"
    "applet-G6jfg90JPXfGZkVb7PPxjpPY"
    $ jq '.[] | select (.state | contains("failed")) | .id, .executable' rap-jobs.json | paste - -
    "job-G6jj9k8JPXfG42094KG5JFX4"  "applet-G6jj9b0JPXf5Q6ZF4G85K156"
    "job-G6jj1zQJPXf34z8v4KqjZKP1"  "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
    "job-G6jg9vQJPXfGbJb54GFkJ33Y"  "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
    "job-G6jg7Y0JPXfG6q53G12vQZK8"  "applet-G6jg6pQJPXf7ypXq33B75Qq1"
    "job-G6jg57QJPXf90Jjv4K8pgkG7"  "applet-G6jfg90JPXfGZkVb7PPxjpPY"
    "job-G6jZk6jJPXf1q1Py5VKX6gJK"  "applet-G6jZjG0JPXf7ZxZP4G5v0X1k"
    "job-G6jYY28JPXfFvFXY4GXB6jG2"  "applet-G6jYXq0JPXf5Q6ZF4G85JVgG"
    "job-G6jY9FQJPXf3pj894GFJ02jy"  "applet-G6jY7zQJPXfG42094KG5Gkyy"
    "job-G6jY858JPXfBKX1X0j434BY5"  "applet-G6jY7zQJPXfG42094KG5Gkyy"
    "job-G6jY740JPXf7V2vJ4G2Gkfj7"  "applet-G6jY6zQJPXf81J984K6kfB3V"
    "job-G6jY5v8JPXfPGQq15k77zPJ9"  "applet-G6jY5jjJPXf6Ffqg4GqF4KPg"
    "job-G6jY4k0JPXfPGQq15k77zP9Q"  "applet-G6jY39jJPXfG42094KG5GkV9"
    "job-G6jXPJQJPXfBbf694G3Fg07K"  "applet-G6jXJJjJPXf7V2vJ4G2GkFbF"
    "job-G6jX7yQJPXfFjzffKJzpqfj7"  "applet-G6jX7JQJPXf3V99x4Gx7K09X"
    "job-G6jVzJ0JPXf5Q6ZF4G85JG09"  "applet-G6jVxQQJPXfGZ0BF33KZfX5Y"
    jq '.[] | select (.state | contains("failed")) | .id, .executable' \
    rap-jobs.json | paste - - | \
    while read JOB_ID APP_ID; do echo dx run $APP_ID --clone $JOB_ID; done
    dx run "applet-G6jj9b0JPXf5Q6ZF4G85K156" --clone "job-G6jj9k8JPXfG42094KG5JFX4"
    dx run "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp" --clone "job-G6jj1zQJPXf34z8v4KqjZKP1"
    dx run "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp" --clone "job-G6jg9vQJPXfGbJb54GFkJ33Y"
    dx run "applet-G6jg6pQJPXf7ypXq33B75Qq1" --clone "job-G6jg7Y0JPXfG6q53G12vQZK8"
    dx run "applet-G6jfg90JPXfGZkVb7PPxjpPY" --clone "job-G6jg57QJPXf90Jjv4K8pgkG7"
    dx run "applet-G6jZjG0JPXf7ZxZP4G5v0X1k" --clone "job-G6jZk6jJPXf1q1Py5VKX6gJK"
    dx run "applet-G6jYXq0JPXf5Q6ZF4G85JVgG" --clone "job-G6jYY28JPXfFvFXY4GXB6jG2"
    dx run "applet-G6jY7zQJPXfG42094KG5Gkyy" --clone "job-G6jY9FQJPXf3pj894GFJ02jy"
    dx run "applet-G6jY7zQJPXfG42094KG5Gkyy" --clone "job-G6jY858JPXfBKX1X0j434BY5"
    dx run "applet-G6jY6zQJPXf81J984K6kfB3V" --clone "job-G6jY740JPXf7V2vJ4G2Gkfj7"
    dx run "applet-G6jY5jjJPXf6Ffqg4GqF4KPg" --clone "job-G6jY5v8JPXfPGQq15k77zPJ9"
    dx run "applet-G6jY39jJPXfG42094KG5GkV9" --clone "job-G6jY4k0JPXfPGQq15k77zP9Q"
    dx run "applet-G6jXJJjJPXf7V2vJ4G2GkFbF" --clone "job-G6jXPJQJPXfBbf694G3Fg07K"
    dx run "applet-G6jX7JQJPXf3V99x4Gx7K09X" --clone "job-G6jX7yQJPXfFjzffKJzpqfj7"
    dx run "applet-G6jVxQQJPXfGZ0BF33KZfX5Y" --clone "job-G6jVzJ0JPXf5Q6ZF4G85JG09"
    dx find jobs --state failed --json | jq '.[] | .id, .executable' | paste - - | \
    while read JOB_ID APP_ID; do echo dx run $APP_ID --clone $JOB_ID; done

    The column is empty

    Column type

    Description

    Example

    string

    A string column has free-text values. This is the default fallback type when Data Profiler fails to cast a column type.

    Patient’s name; Patient’s ID

    integer

    An integer column has integer values.

    Number of children

    float

    A float column has float values.

    Weight; Height

    datetime

    A float column has float values. The default time zone is UTC.

    Date of birth

    data ingestion step
    Full Documentation

    unknown

    Dataset Level Screen

    Dataset-level screen is the default screen when you open Data Profiler. It has the Table Relationship and Table Summary pages. In this section, we describe each component of the screen and its key values.

    The default screen of Data Profiler is at the Table Relationships page of the Dataset level

    Manage Tables

    The Manage Tables controller allows you to hide/show the table(s) from the data profile. The table(s) which are hidden from the ERD will also be hidden from the Data Hierarchy. In order to manage the table display, click on the ‘Manage’ button on the bottom right corner of the screen, then use the toggle to hide/show the tables, and click on the ‘Apply’ button to apply the changes.

    Open the ‘Manage Tables’ controller to show/hide the table(s)

    The data profile is updated after the ‘patients’ table is hidden

    Table Relationships

    A Relationship Diagram (left) with some selected edges highlighted in blue. The selected edges create a Diagram of Overlaps (right)

    This is a simplified Entity Relationship Diagram displayed as a graph. Each node represents a table in your dataset, and each edge represents a column that links two tables. The linked columns are the referenced_entity_field in the data_dictionary. The direction of an edge represents the reference from a foreign-key column to a primary-key column

    FAQs

    Question: There are tables supposed to be linked to each other. Why do they appear unlinked in Data Profiler?

    Answer: The linkage between any two tables are determined by the data_dictionary. Data Profiler does not remove or add linkages to a dataset. You should check your data_dictionary again and make sure that the linkage is correctly specified.

    By clicking on one or more edges, you can view a Diagram of Overlaps that shows how many values the linked columns share between the tables. There are several chart types for a Diagram of Overlaps:

    Venn Diagram

    Venn diagram is the default chart type of Diagram of Overlaps. Each set in this diagram is a table in the selection. The numbers are the values from the column in the selection.

    Question: How should I interpret a Venn diagram having 2 tables, patients and measurements, and the value of their intersection is 90? The column is patient_id.

    Answer: When patients and measurement tables share some patient_ids, It basically means there are 90 patients having measurements data.

    Euler Diagram

    Euler diagrams share the same concept with Venn diagrams. The only difference is the size of overlap sections are proportional to the overlap value.

    Upset Plot

    Upset plot counts the value of all non-empty possible combinations from the selected tables. This plot type is more scalable than the Venn or Euler diagram.

    A common use case of Upset plot is to help answer questions such as “How many patients have full information across tables?”. By creating an Upset plot between the “patients” table and other tables (e.g. diagnosis, measurement, sequence_run, etc.), we can answer the questions by looking at the number of patient ids that are shared across all tables.

    Summary Page

    The Summary page provides summary for both tables and columns in the Dataset. Below are the details of each section.

    The summary of all Tables and Columns in the Dataset

    Table Summary

    The Table Summary shows information about all tables in the dataset. Each row displays various statistics for a table in your dataset, including:

    • # Columns, # Rows: the number of columns, the number of rows

    • Column types: data type of all columns in a table

    • Duplication Rate: the rate of duplication of a whole row in the table

    • Missing Rate: the rate of having an empty cell in the table

    You can click on the hamburger button at the header of each column to sort or filter the data as needed.

    Clicking on the hamburger button to sort or filter the data

    Column Summary

    The Column Summary provides details about every column in the dataset, with each row presenting below information for a specific column.

    • Column name: name of the column

    • Key type: the attributes that are used to define the relationships of tables

    • Description: the title of a column (if provided in the data dictionary file)

    • Provided type: the type of data in the column which is specified in the data dictionary file. If the data dictionary is not provided, it is ‘unknown’

    • Inferred types: the type of data in the column inferred by Data Profiler if the data dictionary is not provided. If the data dictionary is provided, it will be the same as the Provided type

    • Missing Rate: the rate of having an empty cell in a column

    • Duplication Rate: the rate of duplication of values in a column

    You can also click on the hamburger button at the header of each column to sort or filter the data as needed.

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select “Contact Support”

    3. Fill in the Subject and Message to submit a support ticket.

    [email protected]
    https://synthea.mitre.org/downloads
    https://doi.org/10.1016/j.ibmed.2020.100007

    Expression

    TCGA (via GDC)

    Data is publicly available (RNA-Seq, STAR - Counts) from GDC from this page and is downloaded on May 16, 2025.

    Somatic

    TCGA (via GDC/cBioPortal)

    Derived from public SNV, CNV, and Fusion data:

    • SNV data are publicly available and downloaded from GDC on October 17, 2024.

    • CNV Segmented copy number data (.SEG files) are publicly available and were downloaded from GDC on October 6, 2025.

    • Fusion data are publicly available and downloaded from cBioPortal on September 27, 2025.

    Germline

    Synthetic Data Only

    TCGA germline data is not publicly available. This component uses simulated genotypes.

    General Overview

    • You can use both the phenotypical and genomic data when creating a cohort.

    • The phenotypic data (which is one database) is processed and combined with the genomic data (another database) to ensure that they are paired appropriately, and that forms a dataset.

    • You can then use the dataset in Apollo to perform various actions, such as visualizing the data, analyzing all of part of it (called a cohort), and collaborate with others about a particular dataset

    High Level Structure of Datasets

    Each dataset has an important structure.

    First, a data set lies on top of a database. A data set can be copied and moved around the platform, and even deleted. A database, however, cannot without the ingestion process having to be repeated.

    Datasets are the top level structure of the data.

    Each dataset has entities, which are equivalent to tables. The tables contain fields.

    Fields are the variables.

    The graphic below also explains the relationship:

    Structure of a Dataset

    Data sets are patient- centric. All the information goes back to the patient.

    This is important for filtering. If a patient, for example, takes a medication more than once during the progression of their illness, there will be more instance types for the medication than there are people in the cohort.

    Here is a summary graphic of how the data is considered to be patient- centric:

    Datasets, Databases, and Spark

    • Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.

    • A dataset can be thought of as a giant multi-omics matrix.

    • Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across omics type

    • Underlying Apollo is a technology called Spark. All data in Apollo is stored in it.

    • It is made to handle very large datasets and enable fast queries that can't be handled by single computers.

    • It does this by creating RDDs (resilient distributed datasets), which are distributed across the worker nodes. Each node handles only part of the query and reports its back, which is why the queries are very fast.

    • Details about RDDs can be found and

    • Spark databases mean you can query across many columns in the dataset relatively quickly, compared to using a single computer.

    Datasets, Cohorts, and Dashboards

    • Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.

    • A dataset can be thought of as a giant multi-omics matrix,

    • Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across genomics type

    Assay Type

    Source

    Notes

    Clinical

    TCGA (via cBioPortal)

    Data is publicly available ("full" 32 studies) from cBioPortal on October 17, 2024.

    Example 1: Word Count (wc)

    In this example, you will:

    • Learn to write a native DNAnexus applet that executes a Python program

    • Use the dxpy module to download and upload files

    • Use the Python subprocess module to execute an external process and check the return value

    Getting Started

    We'll use the same scarlet.txt file from the bash version of the wc applet. Start off using dx-app-wizard and define the same inputs and outputs as before, but be sure to choose Python for the Programming language:

    Python Template

    The Python template looks like the following:

    1. : DNAnexus execution environment entry point

    2. The input_file listed in the inputSpec is passed to main.

    3. Create a object.

    Update src/python_wc.py to the following:

    1. Import the function.

    2. Use the local filename input_file.txt.

    3. The output file will be called output.txt.

    4. Shadow the input_file variable, overwriting it with the creation of a new

    NOTE: Portable Operating System Interface (POSIX) standards dictate that processes return 0 on success (i.e., zero errors) and some positive integer value (usually in the range 1-127) to indicate an error condition.

    Run dx build to build the applet. Create an job_input.json file with the file ID of your input:

    Run your applet with the input file using --watch to see the output:

    I can inspect the contents of the output file:

    I can verify this is correct by piping the input file to a local execution of wc:

    Debugging Locally

    You can shorten the build/run development cycle by naming the JSON input job_input.json and executing the Python program locally:

    This will download the input as input_file.txt and then create a new local file with the system call:

    Review

    • You have now translated the bash applet for running wc into a native DNAnexus Python applet.

    • You were introduced to the dxpy module that provides functions for making API calls.

    • You used subprocess.getstatusoutput to call an external process and interpret the return value for success or failure.

    In the next section, we'll continue translating bash to Python.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Example 3: fastq_trimmer

    In this example, you will translate the bash app from the previous chapter into Workflow Definition Language (WDL).

    You will learn how to:

    • Use Java Jar files to validate and compile WDL

    • Use WDL to define an applet's inputs, outputs, and runtime specs

    • Compile a WDL task into an applet

    Getting Started

    You will not use a wizard to start this applet, so manually create a directory for your work. Create a file called fastq_trimmer.wdl with the following contents:

    • This line indicates that the WDL follows the .

    • The task defines the body of the applet.

    • The input block defines the same inputs, a File called input_file and an Int (integer) value called quality_score with a default value of 30.

    Checking and Compiling the WDL

    To start, validate your WDL with WOMtool:

    Before compiling the WDL into an applet, use dx pwd to ensure you are in your desired project. If not, run dx select to select a different project, then use the following command to compile the applet:

    Use dx run as in the previous chapter to run the applet with the -h|--help option to that the usage looks identical to the bash version:

    From the perspective of the user, there is no difference between native/bash applets and those written in WDL. You should use whichever syntax you find most convenient to the task at hand. For instance, this applet leverages an existing Docker container created by the rather than adding the binary as a resource.

    You can run the applet using the command-line arguments as shown, or you can create a JSON file with the arguments as follows:

    You can run the applet and watch the job with the following command:

    The output will look quite different from the bash app, but the basics are still the same. In this version, notice that you do not need to download the inputs or upload the outputs. Once the input files are in place, the command block is run and the input files and variables are dereferenced properly. When the job has completed, run dx describe to see the inputs and outputs:

    Download the output file to ensure it looks like a correct result:

    Documentation with Makefiles

    You may find it useful to create a Makefile with all the steps documented in a runnable fashion:

    Now you can run make compile rather than type out the rather long Java command.

    Review

    The WDL version of the FastQTrimmmer applet is arguable simpler than the bash version. It uses just one file, fastq_trimmer.wdl, and about 20 lines of text, whereas the bash version requires at least dxapp.json, a bash script, and the resources tarball.

    In this chapter, you learned how to:

    • Use a Biocontainers Docker image for the necessary binary executables from FASTX toolkit

    • Define the same inputs, outputs, and commands as the bash applet from Chapter 3

    • Use a Makefile to define project shortcuts to validate, compile, and run an applet

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Example 5: samtools with a Docker Image

    This tutorial uses the same samtools applet from but will be using a public Docker Image instead of an asset.

    Step 1: Download the Docker Image

    Please start the Cloud Workstation Application by typing in the following command into the terminal:

    Once the Cloud Workstation Application has started, pull the image from the repository, save the Docker image within the Workstation, and then use dx upload to put the saved image onto the project space.

    First, pull the Docker Image using the following command:

    Example 3: cnvkit

    This example will build on the asset you created in the bash version. You will:

    • Learn how to download the input type array:file

    • Use regular expressions to classify output files

    Introduction to Data Profiler

    A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).

    What is the Data Profiler?

    The Data Profiler is an app within the DNAnexus Tool Library that supports data cleaning and harmonization. It organizes your data into three levels of information: Dataset level, Table level, and Column level. Each level surfaces interactive visualizations on data quality, data coverage, and descriptive statistics to help you understand and identify potential data issues. The Data Profiler also includes an Explorer Mode where you can create customizable visualization using simple drag-and-drop functionality, for deeper exploration beyond the standard metrics. Researchers can bring their data to the Platform and leverage the Data Profiler app to explore and quickly evaluate the readiness of the data for downstream analysis.

    Example 2: fastq_quality_trimmer

    In this exercise, we'll demonstrate a native DNAnexus Python applet that runs the fastq_quality_trimmer binary.

    You will learn:

    • How to use a DXFile object to get file metadata

    • How to use Python functions to choose an output filename using the input file's name

    Home JSON File

    Disclaimer: Portals require a license. These documents are to get you started with your portals. By no means is this the only way to make your portal, nor is this the only way to edit a json file.

    Overview of the home.json file

    • this .json file will edit the home screen.

    Definitions for each of the Somatic Variants Types that were used for data ingestion are:
    here
    here
    Download the input file.
  • Upload the local output file.

  • Add the DX file ID to the output dictionary.

  • Return the output

  • object.
  • Call dxpy.download_dxfile to download the input file identified by the file ID to the local_file name.

  • Execute wc on the local input file and redirect (>) the output to the chosen output filename. This function returns a tuple containing the process's return value and output (STDOUT/STDERR).

  • If the return value is not zero, use sys.exit to abort the program with the output from the system call.

  • If the program makes it to this point, the output file should have been created to upload.

  • Return a Python dictionary with the DNAnexus link to the new outfile object.

  • entry_point
    DXFile
    subprocess.getstatusoutput
    Full Documentation
    DXFile
  • This line defines a variable called basename which uses the basename function to get the filename of the input file.

  • The command block will be executed at runtime. It uses the tilde/twiddle syntax (~{}) to derefence variables. The output is written to a filename using the basename of the input.

  • The output defines a single File called output_file.

  • The runtime specifies a Biocontainers/Docker that contains the FASTX toolkit binaries.

  • 1.0 specification
    Biocontainers Community
    Full Documentation
    How to add debugging output to your Python program

    Getting Started

    The inputs and outputs are the same as in the bash version of this applet. You can start from scratch using dx-app-wizard with the following input specs:

    Input Name
    Type
    Optional
    Default Value

    input_file

    file

    No

    NA

    quality_score

    file

    Yes

    30

    The output specs are as follows:

    Output Name
    Type

    output_file

    file

    Or you can use the dxapp.json from the bash version and change the runSpec file to the name of your Python script and the interpreter to python3 as follows:

    Inside your applet's source code, create resources/usr/local/bin and copy the fastq_quality_trimmer bin to this location. At runtime, the binary will be available at /usr/local/bin/fastq_quality_trimmer, which is in the standard $PATH.

    Python Code

    Update the Python code to the following:

    1. The input_file will be the DNAnexus file ID (e.g., file-FvQGZb00bvyQXzG3250XGbgz), and the quality_score will be an integer value.

    2. Use DXFile.describe to get a Python dictionary of metadata.

    3. Choose a local filename by using either the file's name from the metadata or the file ID.

    4. Download the input file to the chosen local filename.

    5. Split the filename into a basename and extension.

    6. Create an output filename using the input basename and a new extension to indicate that the data has been filtered.

    7. Format a command string.

    8. Print the command for debugging purposes.

    9. Execute the command and check the return value.

    10. If the code makes it to this point, upload the output file and return the file ID to be attached to the job's output.

    Build and Run

    Run dx build in your source directory to create the new applet. Use the new applet ID to execute the applet with a small FASTQ file:

    Verify Ouput

    Use dx head to verify the output looks like a FASTQ file:

    To verify that the applet did winnow the number of reads, I can pipe the output of dx cat to wc to verify that the output file has fewer lines than the input file:

    Review

    • You used DXFile to get the input file's name

    • Your output filename is based on the input file's name rather than a static name like output.txt.

    • You can call Python's print function to add your own STDOUT/STDERR to the applet, which can be an aid in debugging your program.

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Template Options
    
    You can write your app in any programming language, but we provide
    templates for the following supported languages: Python, bash
    Programming language: Python
    python_wc.py
    #!/usr/bin/env python
    # python_wc 0.1.0
    # Generated by dx-app-wizard.
    #
    # Basic execution pattern: Your app will run on a single machine from
    # beginning to end.
    #
    # See https://documentation.dnanexus.com/developer for documentation and
    # tutorials on how to modify this file.
    #
    # DNAnexus Python Bindings (dxpy) documentation:
    #   http://autodoc.dnanexus.com/bindings/python/current/
    
    import os
    import dxpy
    
    @dxpy.entry_point('main') # 1
    def main(input_file): # 2
    
        # The following line(s) initialize your data object inputs on the platform
        # into dxpy.DXDataObject instances that you can start using immediately.
    
        input_file = dxpy.DXFile(input_file) # 3
    
        # The following line(s) download your file inputs to the local file system
        # using variable names for the filenames.
    
        dxpy.download_dxfile(input_file.get_id(), "input_file") # 4
    
        # Fill in your application code here.
    
        # The following line(s) use the Python bindings to upload your file outputs
        # after you have created them on the local file system.  It assumes that you
        # have used the output field name for the filename for each output, but you
        # can change that behavior to suit your needs.
    
        outfile = dxpy.upload_local_file("outfile") # 5
    
        # The following line fills in some basic dummy output and assumes
        # that you have created variables to represent your output with
        # the same name as your output fields.
    
        output = {}
        output["outfile"] = dxpy.dxlink(outfile) # 6
    
        return output # 7
    
    dxpy.run()
    python_wc.py
    #!/usr/bin/env python
    
    import dxpy
    import sys
    from subprocess import getstatusoutput # 1
    
    
    @dxpy.entry_point("main")
    def main(input_file):
        local_file = "input_file.txt" # 2
        output_file = "output.txt" # 3
    
        input_file = dxpy.DXFile(input_file) # 4
        dxpy.download_dxfile(input_file.get_id(), local_file) # 5
    
        rv, out = getstatusoutput(f"wc {local_file} > {output_file}") # 6
    
        if rv != 0: # 7
            sys.exit(out)
    
        outfile = dxpy.upload_local_file(output_file) # 8
        return {"outfile": dxpy.dxlink(outfile)} # 9
    
    
    dxpy.run()
    {
        "input_file": {
            "$dnanexus_link": "file-GgGX7Y8071x46B02JGb515pB"
        }
    }
    $ dx run applet-GgGX740071xJY20Gjkp0JYXB -f python_wc/job_input.json \
        -y --watch \
        --destination project-GXY0PK0071xJpG156BFyXpJF:/output/python_wc/
    Using input JSON:
    {
        "input_file": {
            "$dnanexus_link": "file-GgGX7Y8071x46B02JGb515pB"
        }
    }
    
    Calling applet-GgGX740071xJY20Gjkp0JYXB with output destination
      project-GXY0PK0071xJpG156BFyXpJF:/output/python_wc
    
    Job ID: job-GgGX8P0071x1yfFPkJ8662gQ
    
    Job Log
    -------
    Watching job job-GgGX8P0071x1yfFPkJ8662gQ. Press Ctrl+C to stop watching.
    * Python implementation of wc (python_wc:main) (running) job-GgGX8P0071x1yfFPkJ8662gQ
      kyclark 2024-02-23 16:03:24 (running for 0:01:39)
    2024-02-23 16:11:36 Python implementation of wc INFO Logging initialized (priority)
    2024-02-23 16:11:36 Python implementation of wc INFO Logging initialized (bulk)
    2024-02-23 16:11:40 Python implementation of wc INFO Setting SSH public key
    2024-02-23 16:11:42 Python implementation of wc STDOUT dxpy/0.369.0 (Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
    2024-02-23 16:11:43 Python implementation of wc STDOUT Invoking main with {'input_file': {'$dnanexus_link': 'file-GgGX7Y8071x46B02JGb515pB'}}
    * Python implementation of wc (python_wc:main) (done) job-GgGX8P0071x1yfFPkJ8662gQ
      kyclark 2024-02-23 16:03:24 (runtime 0:01:36)
      Output: outfile = file-GgGXGFj0FbZxjvk1jZPJPkG2
    $ dx cat file-GgGXGFj0FbZxjvk1jZPJPkG2
      8596  86049 513778 input_file.txt
    $ dx cat file-GgGX7Y8071x46B02JGb515pB | wc
        8596   86049  513778
    $ python3 src/python_wc.py
    Invoking main with {'input_file': {'$dnanexus_link': 'file-GgGX7Y8071x46B02JGb515pB'}}
    $ cat output.txt
        8596   86049  513778 input_file.txt
    version 1.0 
    
    task fastq_trimmer { 
        input { 
            File input_file
            Int quality_score = 30
        }
    
        String basename = basename(input_file) 
    
        command <<<
            fastq_quality_trimmer -Q 33 -t ~{quality_score} \ 
                -i ~{input_file} -o ~{basename}.filtered.fastq
        >>>
    
        output { 
            File output_file = "~{basename}.filtered.fastq"
        }
    
        runtime { 
            docker: "biocontainers/fastxtools:v0.0.14_cv2"
        }
    }
    $ java -jar ~/womtool.jar validate fastq_trimmer.wdl
    Success!
    $ java -jar ~/dxCompiler.jar compile fastq_trimmer.wdl
    [warning] Project is unspecified...using currently selected project project-GJ2k24j0vx804FPyBbxqpQBk
    applet-GJ2pgv80vx84zJ4XJF6GPXz7
    usage: dx run applet-GJ2pgv80vx84zJ4XJF6GPXz7 [-iINPUT_NAME=VALUE ...]
    
    Applet: fastq_trimmer
    
    Inputs:
      input_file: -iinput_file=(file)
    
      quality_score: [-iquality_score=(int, default=30)]
    
     Reserved for dxCompiler
      overrides___: [-ioverrides___=(hash)]
    
      overrides______dxfiles: [-ioverrides______dxfiles=(file) [-ioverrides______dxfiles=... [...]]]
    
    Outputs:
      output_file: output_file (file)
    $ cat inputs.json
    {
        "input_file": {
            "$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"
        },
        "quality_score": 35
    }
    $ dx run applet-GJ2pgv80vx84zJ4XJF6GPXz7 -f inputs.json -y --watch
    
    Using input JSON:
    {
        "input_file": {
            "$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"
        },
        "quality_score": 35
    }
    
    Calling applet-GJ2pgv80vx84zJ4XJF6GPXz7 with output destination
    project-GJ2k24j0vx804FPyBbxqpQBk:/
    
    Job ID: job-GJ2ppvQ0vx88k8bv9pvGyjGX
    
    Job Log
    -------
    Watching job job-GJ2ppvQ0vx88k8bv9pvGyjGX. Press Ctrl+C to stop watching.
    $ dx describe job-GJ2ppvQ0vx88k8bv9pvGyjGX
    Result 1:
    ID                    job-GJ2ppvQ0vx88k8bv9pvGyjGX
    Class                 job
    Job name              fastq_trimmer
    Executable name       fastq_trimmer
    Project context       project-GJ2k24j0vx804FPyBbxqpQBk
    Region                aws:us-east-1
    Billed to             org-sos
    Workspace             container-GJ2ppx80773k09b8F6qKGJBb
    Applet                applet-GJ2pgv80vx84zJ4XJF6GPXz7
    Instance Type         mem1_ssd1_v2_x2
    Priority              high
    State                 done
    Root execution        job-GJ2ppvQ0vx88k8bv9pvGyjGX
    Origin job            job-GJ2ppvQ0vx88k8bv9pvGyjGX
    Parent job            -
    Function              main
    Input                 input_file = file-GJ2k2V80vx88z3zyJbVXZj3G
                          quality_score = 35
    Output                output_file = file-GJ2pv300773ypy03Jg2vYZ9f
    ...
    $ dx download file-GJ2pv300773ypy03Jg2vYZ9f
    [===========================================================>]
    Completed 14,357,774 of 14,357,774 bytes (100%) ~/fastq_trimmer_wdl/small-celegans-sample.fastq.filtered.fastq
    $ wc -l small-celegans-sample.fastq.filtered.fastq
       98624 small-celegans-sample.fastq.filtered.fastq
    WDL = fastq_trimmer.wdl
    PROJECT_ID = project-GJ2k24j0vx804FPyBbxqpQBk
    DXCOMPILER = java -jar ~/dxCompiler.jar
    CROMWELL = java -jar ~/cromwell.jar
    WOMTOOL = java -jar ~/womtool.jar
    WORKFLOW_ID = applet-GJ2pgv80vx84zJ4XJF6GPXz7
    
    validate:
        $(WOMTOOL) validate $(WDL)
    
    check:
        miniwdl check $(WDL)
    
    compile:
        $(DXCOMPILER) compile $(WDL) \
            -archive \
            -folder /workflows \
            -project $(PROJECT_ID)
    
    run:
        dx run $(WORKFLOW_ID) \
            -f inputs.json \
            --destination $(PROJECT_ID):/output \
            -y --watch
        "runSpec": {
            "timeoutPolicy": {
                "*": {
                    "hours": 1
                }
            },
            "interpreter": "python3",
            "file": "src/python_fastq_trimmer.py",
            "distribution": "Ubuntu",
            "release": "20.04",
            "version": "0"
        },
    python_fastq_trimmer.py
    #!/usr/bin/env python3
    
    import dxpy
    import os
    import sys
    from subprocess import getstatusoutput
    
    
    @dxpy.entry_point("main")
    def main(input_file, quality_score): # 1
        input_file = dxpy.DXFile(input_file)
        desc = input_file.describe() # 2
        local_file = desc.get("name", input_file.get_id()) # 3
        dxpy.download_dxfile(input_file.get_id(), local_file)  # 4
    
        basename, ext = os.path.splitext(local_file) # 5
        outfile = f"{basename}.filtered{ext}" # 6
        cmd = ( # 7
            f"fastq_quality_trimmer -Q 33 -t {quality_score} "
            f"-i {local_file} -o {outfile}"
        )
        print(cmd) # 8
        rv, out = getstatusoutput(cmd) # 9
    
        if rv != 0:
            sys.exit(out)
    
        dx_output_file = dxpy.upload_local_file(outfile) # 10
        return {"output_file": dxpy.dxlink(dx_output_file)}
    
    
    dxpy.run()
    $ dx run applet-GgKQ5qQ071x5yX7fgbq3PkXB \
    > -f python_fastq_trimmer/job_input.json -y --watch \
    > --destination project-GXY0PK0071xJpG156BFyXpJF:/output/python_fastq_trimmer/
    
    Using input JSON:
    {
        "input_file": {
            "$dnanexus_link": "file-FvQGZb00bvyQXzG3250XGbgz"
        },
        "quality_score": 28
    }
    
    Calling applet-GgKQ5qQ071x5yX7fgbq3PkXB with output destination
      project-GXY0PK0071xJpG156BFyXpJF:/output/python_fastq_trimmer
    
    Job ID: job-GgKQ6x0071x6kf34P5xy2q2b
    
    Job Log
    -------
    Watching job job-GgKQ6x0071x6kf34P5xy2q2b. Press Ctrl+C to stop watching.
    * Python version of fastq_trimmer (python_fastq_trimmer:main) (running)
    * job-GgKQ6x0071x6kf34P5xy2q2b
      kyclark 2024-02-26 14:32:36 (running for 0:00:21)
    2024-02-26 14:33:17 Python version of fastq_trimmer INFO Logging initialized
    (priority)
    2024-02-26 14:33:17 Python version of fastq_trimmer INFO Logging initialized
    (bulk)
    2024-02-26 14:33:21 Python version of fastq_trimmer INFO Downloading bundled
    file resources.tar.gz
    2024-02-26 14:33:22 Python version of fastq_trimmer STDOUT >>> Unpacking
    resources.tar.gz to /
    2024-02-26 14:33:22 Python version of fastq_trimmer STDERR tar: Removing
    leading `/' from member names
    2024-02-26 14:33:22 Python version of fastq_trimmer INFO Setting SSH public key
    2024-02-26 14:33:23 Python version of fastq_trimmer STDOUT dxpy/0.369.0
    (Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
    2024-02-26 14:33:23 Python version of fastq_trimmer STDOUT Invoking main with
    {'input_file': {'$dnanexus_link': 'file-FvQGZb00bvyQXzG3250XGbgz'},
    'quality_score': 28}
    2024-02-26 14:33:24 Python version of fastq_trimmer STDOUT
    fastq_quality_trimmer -Q 33 -t 28 -i small-celegans-sample.fastq -o
    small-celegans-sample.filtered.fastq
    * Python version of fastq_trimmer (python_fastq_trimmer:main) (done)
    * job-GgKQ6x0071x6kf34P5xy2q2b
      kyclark 2024-02-26 14:32:36 (runtime 0:00:20)
      Output: output_file = file-GgKQ79j0B2FQjGbk0qX6j64B
    $ dx head file-GgKQ79j0B2FQjGbk0qX6j64B
    @SRR070372.1 FV5358E02GLGSF length=78
    TTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTNTTTNTTTNTTTATTTATTTATTTATTATTATATATATATA
    +SRR070372.1 FV5358E02GLGSF length=78
    ...000//////999999<<<=<<666!602!777!922!688:669A9=<=122569AAA?>@BBBBAA?=
    @SRR070372.2 FV5358E02FQJUJ length=177
    TTTCTTGTAATTTGTTGGAATACGAGAACATCGTCAATAATATATCGTATGAATTGAACCACACGGCACATATTTGAACTTGTTCGTGAAATTTAGCGAACCTGGCAGGACTCGAACCTCCAATCTTCGGATCCGAAGTCCGACGCCCCCGCGTCGGATGCGTTGTTACCACTGCTT
    +SRR070372.2 FV5358E02FQJUJ length=177
    222@99912088>C<?7779@<GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIC;6666IIIIIIIIIIII;;;HHIIE>944=>=;22499;CIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIH?;;;?IIEEEEEEEEIIII77777I7EEIIEEHHHHHIIIIIIIIIIIIII
    @SRR070372.3 FV5358E02GYL4S length=70
    TTGGTATCATTGATATTCATTCTGGAGAACGATGGAACATACAAGAATTGTGTTAAGACCTGCAT
    $ dx cat file-GgKQ79j0B2FQjGbk0qX6j64B | wc -l
       99952
    
    $ dx cat file-FvQGZb00bvyQXzG3250XGbgz | wc -l
      100000

    The path will include the tag from the Docker Repository.

  • Use up to date Docker Images from reliable sources

  • Next, save the Docker Image:

    • -o : the output. The file needs to be with the .tar.gz ending

    • The image will be referenced with the path, including tags

    Finally, upload the saved image to the project:

    • Add –path project-ID:/ to dx upload command to ensure that it is being added to the Cloud Workspace Container.

    When finished uploading, utilize Cloud Workstation to use the Docker image using:

    or terminate the Cloud Workstation job, and then proceed to building the applet.

    Step 2: Building the Applet

    We will use dx-app-wizard to create a skeleton applet structure with these files:

    Metadata:

    First, give the applet a name. The prompt shows that only letters, numbers, a dot, underscore, and a dash can be used. As stated earlier, this applet name will also be the name of the directory. Use samtools_count_docker_bundle:

    Next is the title. Note that the prompt includes empty square brackets ([]), which contain the default value if Enter is pressed. As title is not required, it contains the empty string, but add an informational title “Samtools Count”

    Likewise, the summary field is not required:

    The version is also optional, and press Enter to take the default:

    Input Specification:

    There is one input for this applet, which is a BAM file.

    Use the parameters for the input section:

    • name: bam

    • label: BAM file

    • class: file

    • optional: false

    When prompted for the first input, enter the following:

    • The name of the input will be used as a variable in the bash code, so use only letters, numbers, and underscores as in bam or bam_file.

    • The label is optional, as noted by the empty square brackets.

    • The types include primitives like integers, floating-point numbers, and strings, as well as arrays of primitive types.

    • This is a required input. If an input is optional, provide a default value.

    When prompted for the second input, press Enter:

    Output Specification:

    There is one output for this applet, which is a counts file.

    Use the parameters for the output section:

    • name: counts

    • label: counts file

    • class: file

    When prompted for the first output name, enter the following:

    • This name will also become a bash variable, so best practice is to use letters, numbers, and underscores.

    • The label is optional.

    • The class must be from the preceding list. To be reminded of the choices, press the Tab key twice.

    When prompted for the second output, press Enter:

    Additional Settings

    Here are the final settings to complete the wizard:

    • Timeout Policy: 48h

    • Programming language: bash

    • Access to internet: No (default)

    • Access to parent project: No (default)

    • Instance Type: mem1_ssd1_v2_x4 (default)

    Applets are required to set a maximum time for running to prevent a job from running an excessively long time. While some applets may legitimately need days to run, most probably need something in the range of 12-48 hours. As noted in the prompt, use m, h, or d to specify minutes, hours, or days, respectively:

    For the template language, select from bash or Python for the program that is executed when the applet starts. The applet code can execute any program available in the execution environment, including custom programs written in any language. Choose bash:

    Next, determine if the applet has access to the internet and/or the parent project. Unless the applet specifically needs access, such as to download a file at runtime, it's best to answer no:

    Lastly, I must specify a default instance type. The prompt includes an abbreviated list of instance types. The final number indicates the number of cores, e.g., _x4 indicates 4 cores. The greater the number of cores, the more available memory and disk space. In this case, a small 4-core instance is sufficient:

    The user is always free to override the instance type using the --instance-type option to dx run.

    Files From dx-app-wizard

    The final output from dx-app-wizard is a summary of the files that are created:

    1. Readme.developer.md : This file should contain applet implementation details.

    2. Readme.md: This file should contain user help.

    3. dxapp.json: The answers from dx-app-wizard are used to create the app metadata.

    4. resources/ : The resources directory is for any additional files you want available on the runtime instance.

    5. src/ : The src (pronounced "source") is a conventional place for source code, but it's not a requirement that code lives in this directory.

    6. src/samtools_count.sh : This is the bash script that will be executed when the applet is run.

    7. test/ The test directory is empty and will not be discussed in this section.

    The contents of the resources directory will be placed into the root directory of the runtime instance. For instance, if there is a file resources/my_tool, then it will be available on the runtime instance as /my_tool. For the sh code, reference the full path (/my_tool) or expand the $PATH variable to include /. Best practice is to create the directory structure resources/usr/local/bin/, and then the file will be at /usr/local/bin/my_tool as /usr/local/bin normally part of $PATH.

    Dxapp.json

    This is where the formatting from the dx-app-wizard is listed in a .json file. If needed, change the settings for the output, input, version, etc within the json file.

    The first section is the metadata, as shown below:

    The next section(s) are Inputs and Outputs, shown below:

    Finally, the last section is the Additional Settings, shown below:

    Adding A Docker Image into the Resources Folder

    Add your Docker Image to the resources folder.

    1. dx download the samtools.tar.gz

    2. mv samtools.tar.gz to the samtools_count_docker_bundle/resources/ folder

    Samtools_docker.sh

    Update the following .sh code file for this applet:

    • #!/bin/bash is the “shebang” command to show that it is a bash script

    • set -exuo pipefail is the pragma to show each command as it is executed and to halt on undefined variables or failed system calls

    • Within the “main” section, there are code lines that:

      • Echo the value of the input, “bam”, using the name $bam, which is part of the input Spec

      • Download the input file onto the job instance, with the output being the name of the bam file (ex: ___.bam)

      • The first Docker command, which loads the saved Docker image, samtools.tar.gz (which is in the resources folder)

      • Assigning a counts_id variable for the name of the counts file output for samtools

      • The second Docker Command

        • Docker run to run the Docker Image

        • -v /home/dnanexus:/home/dnanexus to mount the volume

        • The name of the Docker Image, including the tag.

      • Assigning a variable (upload) for uploading the counts file back to the project

      • Using the upload variable AND the output spec in the json file for the dx-jobutil-add-output command

    Building the Applet

    Once you have added the Docker Image to the resources folder and edited the .sh and .json files, use the following command to create your applet in the project of your choice:

    Then, proceed to test your applet!

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Example 3: samtools
    Getting Started

    We'll call our new applet python_cnvkit. If you want to start from dx-app-wizard, use the following specs for the inputs and outputs:

    Input Name
    Type
    Optional
    Default Value

    bam_tumor

    array:file

    No

    NA

    reference

    file

    No

    NA

    The output specs are as follows:

    Output Name
    Type

    cns

    array:file

    cns_filtered

    array:file

    plot

    array:file

    You can also copy the bash applet directory and update the runSpec in dxapp.json to run a Python script and use the CNVKit asset from before:

    Here is the input.json:

    Python Code

    Update src/python_cnvkit.py to the following:

    1. Use a Python list comprehension to generate a list of file IDs for the tumor BAM files.

    2. Download the reference file.

    3. Initialize a list to hold the download BAM paths.

    4. Download each BAM file into a directory and append the path to the bam_files list.

    5. Create, print, and run the command to execute CNVkit.

    6. Find all the files created in the output directory. The function only returns the filenames, so append the directory name.

    7. For each of the output file categories, filter the output files and upload the output files matching the expected extension.

    8. Compile the given regular expression.

    9. Create a DX file ID link for each uploaded file.

    10. the given files for those matching the regex.

    NOTE: The regex (?<!.call).cns$ uses a negative lookbehind to ensure that .call is not preceding .cns.

    Here is the output from the job:

    Review

    • You used a for loop to download multiple input BAM files into a local directory.

    • You used regular expressions to classify the output files into the three output labels.

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Why use the Data Profiler?

    The Data Profiler app saves significant time by generating consistent and comprehensive reports on data quality. It helps support informed decision-making, allowing experts to fully understand the data before downstream analysis. From data collection and cleaning to feature engineering, continuously profiling data to understand its evolution and maintain consistent quality throughout the data transformation process is important to help identify potential issues early, enabling adjustments that optimize analysis and performance.

    Core features of Data Profiler

    This tool quickly analyzes and visualizes large dataset input from CSV,Parquet, or DNAnexus Apollo Dataset (or Cohort). The point-and-click solution efficiently provides summary statistics and visualizations, enabling a comprehensive understanding of the data. It also highlights data inconsistencies and complexities (e.g., missing and imbalanced data) in a logical and organized manner, guiding you through the structure and content of your data.

    Getting Started

    Access to the App

    There are two ways to run the application:

    1. Direct Access: Go to this link to open the app.

    2. Platform Navigation: Click on the top navigation bar, then select Tools, proceed to the tool library, search for the “Data Profiler” app, select it, then select run within the documentation to start the app.

    Inputs

    To run the app, you need to provide the required input files, which are .csv or .parquet files , or a DNAnexus Apollo Dataset (or Cohort).

    If you run the app with .csv files or .parquet files, there is an optional input for the Data Dictionary. This is the same Data Dictionary used by Data Model Loader to generate the DNAnexus Apollo Dataset.

    Input name

    Mandatory/ Optional

    Input type/format

    Description

    input_files

    Optional

    A list of CSV, TSV, TXT, or parquet files

    This is the data that will be profiled by this application. Each file is a table in your dataset. Only one of the following two options should be provided: input_files and dx_record

    dx_record

    Optional

    A DNAnexus Apollo Dataset (or Cohort)

    The data in this Dataset (or Cohort) will be profiled by this application.

    data_dictionary

    optional

    A CSV file

    This file indicates the relationship between the tables in input_files.

    If not provided, the table relationship will be inferred in the job.

    Tables for Inputs

    For this example,there are 2 tables in your dataset:

    • patients.csv: a table with patient IDs and other clinical information of the patient

    • encounters.csv: a table of encounters (i.e. hospital visits) of all patients in the patient.csv

    patients.csv

    patient_id

    name

    P0001

    John Doe

    P0002

    Jane Roe

    encounters.csv

    encounter_id

    patient_id

    E0001

    P0001

    E0002

    P0001

    E0003

    P0002

    E0004

    P0002

    In this example dataset, there are 2 patients in the patients.csv, each patient visited the hospital twice.

    Data Dictionary

    Even though data_dictionary is optional, it is crucial for cross-table functions in Data Profiler. We highly recommend creating one for your dataset.

    The data_dictionary is a CSV file that tells Data Profiler how to connect patients.csv and encounters.csv. Given this example, the linked column between these tables is patient_id. The data_dictionary can be as simple as:

    entity

    name

    type

    primary_ key_type

    referenced_entity_field

    relationship

    patients

    patient_id

    string

    en counters

    encounter_id

    string

    There are more columns in the data_dictionary that are not mentioned in this example. However, those columns are not required. If you are interested in the full form of data_dictionary or the meaning of each column, please visit this documentation.

    There is no need to specify anything in the OUTPUTS section. Once your inputs are ready, click Start Analysis to begin.

    Job Settings

    In the Review & Start modal, you can either customize the job settings before running the applet or leave them at their default values. The settings you can modify include:

    • Job Name

    • Output Location

    • Priority

    • Spending Limit

    • Instance Type

    Once you’ve made your adjustments or are satisfied with the default settings, click Launch Analysis to start the job.

    Opening the App

    After launching the analysis, you will be redirected to the Monitor screen. From there, click the job name to view the job details.

    It may take a few minutes for the applet to be ready. To check the status, click View Log and wait for the message indicating that the applet is ready. Once you see the message, click Open Worker URL to launch the app.

    The Data Profiler is an HTTPS application on the DNAnexus Platform, which means it should be accessed via the Job URL. It typically takes a few minutes for the web interface to be ready. If you encounter any issues while visiting the Job URL, you can check the job logs for the following message:

    Logs from a job instance of Data Profiler indicating the web interface is ready

    If this line appears in your job logs, it confirms that the Data Profiler is ready to be accessed through the Job URL.

    If you attempt to click the button before the URL is ready, you may encounter a “502 Bad Gateway” error. This is not a problem— it simply means you need to wait a bit longer before the environment is fully prepared.

    Selecting the data fields to profile

    If you run Data Profiler with a DNAnexus Apollo Dataset (or Cohort), you will be able to select the specific data fields to profile. If you want to profile the whole Dataset, select all data fields and start the job by clicking on the “Start profiling” button.

    The table to select columns (data fields) to profile

    Resources

    Full Documentation

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select “Contact Support”

    3. Fill in the Subject and Message to submit a support ticket.

    You can add different sections, links, projects, etc into the json file
  • You can also set a banner for the home page

  • If you have questions about how to use a json file, please view this section

    Overview of the Sections of a portal and matching json files:

    Example of the home.json file

    Sections in your home.json file

    In your home.json file, you have to have this as the beginning of the json:

    After that, you can customize exactly what you are wanting.

    Walk-through of Each Section:

    Banner Image:

    Items Under the Banner Image:

    • There can be as many of these as you would like

    • You can also add in tables, images, and footers

    Code For Template Projects Section:

    Code for Academy Links Section:

    Code for DNAnexus Links Section:

    Examples for Other Items to Add (not included in my example image)

    EXAMPLE: Code for Images (not banner image):

    EXAMPLE: Code for Tables:

    EXAMPLE: Code for footer:

    Please note that when you are done with your json, to please ensure it is the right format.

    Resources

    Portal Documentation

    Full Documentation

    Please email [email protected] to create a support ticket if there are technical issues.

    Example 1: Word Count (wc)

    To get started, you will build a native bash applet that will execute the venerable wc (word count) Unix command-line program on a file. In this example, you will:

    • Use the dx-app-wizard to create the skeleton of a native bash applet

    • Define the inputs and outputs of an applet

    • Use dx build to build the applet

    • Import data from a URL

    • Use dx run to run the applet

    Understanding wc

    The wc command takes one or more files as input. So that we have the same input file, please execute the following command to fetch the URL from Project Gutenberg and write the contents to the local file scarlet.txt:

    Or use curl:

    By default, wc will print the three columns showing the number of lines, words, and characters of text, in that order, followed by the name of the file:

    The output from your version of wc may differ slightly as there are several implementations of the program. For instance, the preceding output is on macOS, which is the BDS version, but the applet will run on Ubuntu Linux using the GNU version. Both programs work essentially the same.

    The goal of this applet will be to accept a single file as input and capture the standard out (aka STDOUT) of wc to report the number of lines, words, and characters in the file.

    Using dx-app-wizard

    Next, you will create an applet that will accept this file as input, transfer it to a virtual machine, run wc on the file, and return the preceding output as a new file. Run the dx-app-wizard to interactively answer questions about the inputs, outputs, and runtime requirements. Start by executing the program with the -h|--help flag to read the documentation:

    As shown in the preceding usage, the name of the applet may be provided as an argument. For instance, you can run dx-app-wizard wc to answer the first question, which is the name of the applet. Note the naming conventions for the applet name, which you should also follow for naming the input and output variables:

    Because the name was provided as an argument, the prompt shows [wc]. All the prompts will show a default value that will be used if you press the Enter key. If you wish to override this value, type a new name; otherwise, press Enter.

    Next, you will be prompted for a title. The empty brackets ([]) indicate this is optional, but I will provide "Word Count":

    Likewise, the summary is optional, but I will provide one:

    Indicate the version with major, minor, and patch release:

    The input specification follows. Use the name input_file for the first input name and whatever label you like. For the class, choose file to indicate that the user must supply a valid file, and specify that this input is not optional:

    As this is the only input, press Enter when prompted for a second input and move to the output specification. To start, call the output outfile and use the class of file:

    There is no other output for now, so press Enter to move on to the Timeout Policy. You may choose any amount of time you like such as "1h" to indicate 1 hour:

    Next, you will choose whether to use bash or Python as the primary language of the applet. Choose bash:

    Choosing bash means that your app will execute a bash script that will use commands from the dxpy module to do things like download and upload files as well as execute any command on the runtime instance, such as custom programs you write in Python, R, C, etc. Choosing Python here means that a Python script will be executed, and it can use the same Python module to do everything the bash script does. This tutorial will only demonstrate bash apps. There is no advantage one language has over the other. You should choose whichever suits your tastes.

    During runtime, some apps may need to fetch resources from the internet or from the parent project. Neither of these will apply to this applet, so answer "no" for the next two questions:

    Lastly, you will choose a default instance type on which the applet will run. I usually start with the default value, which is a fairly modest machine. If an applet proves it needs more resources, refer to the to choose something else:

    The wizard will finish with a listing of the files it has created:

    As noted, you will find the following structure in the directory wc:

    1. A directory for tests, mostly used internally by DNAnexus.

    2. A directory for assets like files or binaries you would like copied to the rutime instance.

    3. A JSON file describing the metadata for the applet.

    4. A documentation stub you may wish to update.

    Inspecting dxapp.json

    In the preceding step, the applet's inputs, outputs, and system requirements were written to the file dxapp.json, which is in JSON (JavaScript Object Notation) format. Open this file to inspect the contents, which begins with the basic metadata about the app:

    The inputSpec section shows that this applet takes a single argument of the type file. Update the patterns to include .txt:

    The outputSpec shows that the applet will return a file:

    The runSpec describes the runtime for the applet:

    • The default VM is Ubuntu 20.04, which includes Python v3 and R v3. You may also indicate Ubuntu 16.04, which has Python v2.

    • If you need Ubuntu 16.04 with Python v3, indicate version 1 here; otherwise, leave this 0.

    The author has more success installing Python v2 on Ubuntu 20.04 rather than running an older Linux distro.

    Finally, the regionalOptions describe the system requirements:

    You may use a text editor to alter this file at any time, after which you will need to rebuild the applet.

    Editing the Runtime Code

    As indicated in runSpec, the applet will execute the bash script src/wc.sh at runtime. The app wizard created a template that shows one method for download the input file and uploading the output file. Here is a modified version that removes most of the comments for the sake of brevity and adds the applet's business logic in the middle:

    • I've added this pragma to show each command as it's executed and to halt on undefined variables or failed system calls.

    • This will download the input file to a local file called input_file on the running instance.

    • Execute wc on input_file and redirect standard out to the file output.

    The local variables $input_file and $output match the names used in the inputSpec and outputSpec. They will only exist at runtime.

    Creating a Project for the Applet and Data

    Applets and data must live inside a project, so create a new one either using the web interface or the command line by executing dx new project:

    Next, you will add the scarlet.txt file to the project. There are several ways you can do this. From the web interface, you can click the "Add" button that will show you options two relevant options:

    • "Upload Data": This will allow you to upload a file your local computer. You can drag and drop the file into the dialog box or use the file browser to select the file.

    • "Add Data From Server": This will launch an app that can import files accessible by a URL such as from a web address or FTP server. You should use the Project Gutenberg URL from earlier.

    You can also use the dx upload command. If you created the project using the web interface, you will first need to run dx select to select your project:

    Note the file's ID, which we will use later for the applet's input. If you use the web interface to upload, you can click the information "I" in the circle to see the file's metadata.

    From the command line, you can use dx ls with the -l|--long option to see the file ID:

    Building and Running The Applet

    It's impossible to debug this program locally, so next you will build the applet and run it. If you are in the wc directory, run dx build to build the applet; if you are in the directory above, run dx build wc to indicate the directory that contains the applet. Subsequent builds will require the use of of the -f|--overwrite or -a|--archive flag to indicate what to do with the previous version. For consistency's sake, I always run with the -f flag:

    From the web interface, you can now view a web form that will allow you to execute the applet.

    You do the same process that is listed in the Overview of that Platform section.

    Running the Applet from the Command Line

    You can also run the applet from the command line using the applet's ID. To begin, use dx run with the -h|--help flag to see the inputs and outputs of the applet:

    Run the same command without the help flag to enter an interactive session where you can indicate the input file using the file's ID noted earlier:

    You may also use specify the file on the command line:

    Notice in both instances, the input is formatted as a JSON document for submission. Copy that JSON into a file with the following contents:

    Use this file as the -f|--file input for the applet along with the -y flag to indicate you want to proceed without further confirmation and the --watch flag to enter into a watch of the applet's progress:

    The end of the job's output should look like the following:

    Run dx describe on the indicated output file ID to see the metadata about the file. Then execute dx cat to see the contents of the file, which should be the same results as when the program ran locally:

    Review

    In this chapter, you did the following:

    • Learned the structure of a native bash and how to use the wizard to create a new app

    • Built an app and ran it from the command line and the web interface

    • Inspected the output of an applet

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Example 5: workflow

    In this example, you will learn:

    • How to to accept a BAM file as a workflow input

    • Break the BAM into slices by chromosome

    • Distribute the slices in parallel to count the number of alignments in each

    Getting Started

    To begin, create a new directory called view_and_count and a workflow.wdl file.

    Here is the workflow defintion you should add:

    • The name of this workflow is bam_chrom_counter.

    • The workflow accepts a single, required File input that will be called bam as it is expected to be a BAM file.

    • Use a to define a String value of the Docker file containing Samtools.

    Following is the slice_bam task that uses to index the input BAM file and break it into separate files for each of the 22 human chromosomes:

    • The inputs to this task are the BAM file and the name of the Docker image.

    • The command block uses triple-angle brackets because it must use the dollar sign ($) in shell code.

    • Use on the input BAM file for fast random access to the alignments.

    • The $()

    The count_bam task is written to handle just one BAM slice:

    • This BAM input will be a slice of alignments for a given region. Naming this bam does not interfere with the bam variable in the workflow or any other task.

    • Use the command with -c|--count to count the number of alignments in the given file.

    • The output of this task uses the function to read the STDOUT from the command as an integer value.

    At this point, I like to use miniwdl to check the syntax:

    As no errors are reported, I will compile this onto the DNAnexus platform:

    Finally, I will run this workflow using a sample BAM file:

    Return to the DNAnexus website to monitor the progress of the analysis.

    Placing Task Definitions in Files

    As the number of tasks increase, workflow definitions can get quite long. You can shorten the workflow.wdl by placing each task in a separate file, which also makes it easier to reuse a task in a separate workflow. To do this, create a subdirectory called tasks, and then create a file called tasks/slice_bam.wdl with the following contents:

    Also create the file tasks/count_bam.wdl with the following contents:

    Both of the preceding tasks are identical to the original definitions, but note that the files include a version that matches the version of the workflow. Change workflow.wdl as follows:

    • Use to include WDL code from a file or URI. Note the use of the as clause to alias the imports using a different name.

    • Call task_slice_bam.slice_bam from the imported file using as to give it the same name as in the original workflow.

    • Do the same with task_count_bam.count_bam.

    Use miniwdl to check your syntax, then use dxCompiler to create an app.

    Review

    In this lesson, you learned how to:

    • Accept a file as a workflow input

    • Define a non-input declaration

    • Use scatter to run tasks in parallel

    • Use the output from one task as the input to another task

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Example 2: Word Count (wc)

    You can write the wc applet using Workflow Description Language (WDL), which is a high-level way to define and chain tasks. You will start by defining a single task, which compiles to an applet on the DNAnexus platform.

    In this example, you will:

    • Write the wc applet using WDL

    Introducting WDL

    In the bash applet, the inputs, outputs, and runtime specifications are defined in the dxapp.json file, and the code that runs lives in a separate file. WDL combines all of this into a single file. Create a new directory for your work, and then add the following to a file called wc.wdl:

    • There are several versions of WDL, and this indicates the file will use .

    • A task in WDL will compile to an applet in DNAnexus.

    • The input block equates to the inputSpec from the previous chapter. Each input value is declared with a . Here the input is a File.

    Validating WDL with WOMtool and miniwdl

    First, ensure you have a working Java compiler and have installed all the Java Jar files as described in Chapter 1. Use WOMtool to validate the WDL syntax:

    If you installed the Python miniwdl program, you can also use it to check the syntax. The output on success is something like a parse tree:

    To demonstrate the output on error, I'll change the word File to Fiel:

    Here is the equivalent error from WOMtool:

    The two tools are written in different languages (Java and Python) and have different stringencies of parsing and different ways of reporting errors. You may find it helpful to use both to track down errors.

    Compiling a WDL Task into an Applet

    First, use dx pwd to check if you are in your wc project; if not, use dx select to change. Now you can use the dxCompiler jar file you downloaded in Chapter 1 to compile the WDL into an applet:

    Run the new applet from the CLI with the help flag to inspect the usage:

    Whether you use bash or WDL to write an applet, the compiled result works the same for the user.

    Running the Applet

    If you look in the web interface, you should see a new wc_wdl object in the project as shown in Figure 1.

    Click on the applet to launch the user interface as shown in Figure 2. Select an input file and launch the applet.

    As with the bash version, you can launch the applet using the command line arguments:

    The output from the job will look different, but the result will be the same. You can use dx describe with the --json option to get a JSON document describing the entire job and pipe this to the tool to extract the output section:

    The dx cat command allows you to quickly see the contents of the output file without having to download it to your computer:

    This is the same output as from the previous chapter.

    Review

    Depending on your comfort level with WDL, you may or may not find this version simpler than the bash version. The result is the same no matter how you write the applet, so it's a matter of taste as to which you should select.

    In this chapter, you learned how to:

    • Write a WDL task

    • Use WOMtool and miniwdl to validate WDL syntax

    • Compile a WDL task into an applet

    • Use the JSON output from dx describe and jq

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

     dx run app-cloud_workstation --instance-type mem1_ssd2_v2_x72 --ssh -y
    docker pull biocontainers/samtools:v1.9-4-deb_cv1
    docker save -o samtools.tar.gz biocontainers/samtools:v1.9-4-deb_cv1
    dx upload samtools.tar.gz --path project-ID:/
    docker run -it biocontainers/samtools:v1.9-4-deb_cv1
    dx-app-wizard
    DNAnexus App Wizard, API v1.0.0
    Basic Metadata
    Please enter basic metadata fields that will be used to describe your app. Optional fields are denoted by options with square brackets. At the end of this wizard, the files necessary for building your app will be generated from the answers you provide.
    The name of your app must be unique on the DNAnexus platform.  After creating your app for the first time, you will be able to publish new versions using the same app name.  App names are restricted to alphanumeric characters (a-z, A-Z, 0-9), and the characters ".", "_", and "-".
    
    App Name: samtools_count_docker_bundle
    The title, if provided, is what is shown as the name of your app on the website.  It can be any valid UTF-8 string.
    Title []: Samtools Count
    The summary of your app is a short phrase or one-line description of what your app does.  It can be any UTF-8 human-readable string.
    Summary []: Count SAM/BAM alignments
    You can publish multiple versions of your app, and the version of your app is a string with which to tag a particular version.  We encourage the use of Semantic Versioning for labeling your apps (see http://semver.org/ for more details).
    
    Version [0.0.1]:
    Input Specification
    You will now be prompted for each input parameter to your app. Each parameter should have a unique name that uses only the underscore "_" and alphanumeric characters, and does not start with a number.
    
    1st input name (<ENTER> to finish): bam 
    
    Label (optional human-readable name) []: BAM File 
    
    Your input parameter must be of one of the following classes: 
    applet         array:file     array:record   file           int
    array:applet   array:float    array:string   float          record
    array:boolean  array:int      boolean        hash           string
    
    Choose a class (<TAB> twice for choices): file
    
    This is an optional parameter [y/n]: n
    2nd input name (<ENTER> to finish):
    Output Specification
    You will now be prompted for each output parameter of your app.  Each parameter should have a unique name that uses only the underscore "_" and alphanumeric characters, and does not start with a number.
    
    1st output name (<ENTER> to finish): counts 
    
    Label (optional human-readable name) []: Counts File 
    
    Choose a class (<TAB> twice for choices): file
    2nd output name (<ENTER> to finish):
    Timeout Policy
    Set a timeout policy for your app. Any single entry point of the app that runs longer than the specified timeout will fail with a TimeoutExceeded error. Enter an int greater than 0 with a single-letter suffix (m=minutes,h=hours, d=days) (e.g. "48h").
    Timeout policy [48h]:
    Template Options
    You can write your app in any programming language, but we provide templates for the following supported languages: Python, bash
    Programming language: bash
    Access Permissions
    If you request these extra permissions for your app, users will see this fact when launching your app, and certain other restrictions will apply. For more information, see https://documentation.dnanexus.com/developer/apps/app-permissions.
    
    Access to the Internet (other than accessing the DNAnexus API).
    Will this app need access to the Internet? [y/N]: n
    
    Direct access to the parent project. This is not needed if your app specifies outputs,which will be copied into the project after it's done running.
    
    Will this app need access to the parent project? [y/N]: n
    Default instance type: The instance type you select here will apply to all entry points in your app unless you override it. See https://documentation.dnanexus.com/developer/api/running-analyses/instance-types for more information.
    
    Choose an instance type for your app [mem1_ssd1_v2_x4]:
    
    *** Generating DNAnexus App Template... ***
    Your app specification has been written to the dxapp.json file. You can specify more app options by editing this file directly (see https://documentation.dnanexus.com/developer for complete documentation).
    
    Created files:
        samtools_count_docker_bundle/Readme.developer.md 
        samtools_count_docker_bundle/Readme.md 
        samtools_count_docker_bundle/dxapp.json  
        samtools_count_docker_bundle/resources/  
        samtools_count_docker_bundle/src/ 
        samtools_count_docker_bundle/src/samtools_count.sh 
        samtools_count_docker_bundle/test/  
    
    App directory created!  See https://documentation.dnanexus.com/developer for tutorials on how to modify these files, or run "dx build samtools_count" or "dx build --create-app samtools_count_docker_bundle" while logged in with dx.
    Running the DNAnexus build utility will create an executable on the DNAnexus platform.  Any files found in the resources directory will be uploaded so that they will be present in the root directory when the executable is run.
    {
      "name": "samtools_count_docker_bundle",
      "title": "Samtools Count",
      "summary": " Count SAM/BAM alignments",
      "dxapi": "1.0.0",
      "version": "0.0.1",
    "inputSpec": [
        {
          "name": "bam",
          "label": "BAM file",
          "class": "file",
          "optional": false,
          "patterns": [
            "*.bam"
          ],
          "help": ""
        }
      ],
      "outputSpec": [
        {
          "name": "counts",
          "label": "counts file",
          "class": "file",
          "patterns": [
            "*"
          ],
          "help": ""
        }
      ],
    "runSpec": {
        "timeoutPolicy": {
          "*": {
            "hours": 3
          }
        },
        "interpreter": "bash",
        "file": "src/samtools_docker.sh",
        "distribution": "Ubuntu",
        "release": "24.04",
        "version": "0"
      },
      "regionalOptions": {
        "aws:us-east-1": {
          "systemRequirements": {
            "*": {
              "instanceType": "mem1_ssd1_v2_x4"
            }
          }
        }
      }
    }
    #!/bin/bash
    
    set -exuo pipefail
    
    main() { 
        echo "Value of bam: '$bam'" 
        dx download "$bam" -o "$bam_name"
        docker load  < "/samtools.tar.gz"
        counts_id=${bam_prefix}.counts.txt
        docker run -v /home/dnanexus:/home/dnanexus \
            biocontainers/samtools:v1.9-4-deb_cv1 samtools view -c "/home/dnanexus/${bam_name}" > "/home/dnanexus/${counts_id}" 
    
       upload=$(dx upload "$counts_id" --brief) 
        dx-jobutil-add-output counts "$upload" --class=file 
    }
    
    dx build samtools_count_docker_bundle
        "runSpec": {
            "timeoutPolicy": {
                "*": {
                    "hours": 48
                }
            },
            "interpreter": "python3",
            "file": "src/python_cnvkit.py",
            "distribution": "Ubuntu",
            "release": "20.04",
            "version": "0",
            "assetDepends": [{"id": "record-GgP33b00BppJKpyyFxGpZJYf"}],
        }
    {
        "bam_tumor": [
            {
                "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
            }
        ],
        "reference": {
            "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
        }
    }
    python_cnvkit.py
    #!/usr/bin/env python
    
    import os
    import dxpy
    import re
    import sys
    from typing import List
    from subprocess import getstatusoutput
    
    
    @dxpy.entry_point("main")
    def main(bam_tumor, reference):
        bam_tumor = [dxpy.DXFile(item) for item in bam_tumor] # 1
    
        reference = dxpy.DXFile(reference) # 2
        reference_name = reference.describe().get("name", "reference.cnn")
        dxpy.download_dxfile(reference.get_id(), reference_name)
    
        bam_dir = "bams"
        os.makedirs(bam_dir)
    
        bam_files = [] # 3
        for file in bam_tumor:
            desc = file.describe()
            file_id = file.get_id()
            path = os.path.join(bam_dir, desc.get("name", file_id))
            dxpy.download_dxfile(file_id, path) # 4
            bam_files.append(path)
    
        out_dir = "cnvkit-out"
        cmd = (
            f"cnvkit.py batch {' '.join(bam_files)} "
            f"-r {reference_name} "
            f"-p $(expr $(nproc) - 1) "
            f"-d {out_dir} --scatter"
        )
        print(cmd)
    
        rv, out = getstatusoutput(cmd) # 5
        if rv != 0:
            sys.exit(out)
    
        out_files = [os.path.join(out_dir, file) for file in os.listdir(out_dir)] # 6
        print('out_files = {",".join(out_files)}')
    
        return {
            "cns": upload("\.call\.cns$", out_files), # 7
            "cns_filtered": upload("(?<!\.call)\.cns$", out_files),
            "plot": upload("-scatter.png$", out_files),
        }
    
    
    def upload(pattern: str, paths: List[str]) -> List[str]:
        """Upload files matching a pattern and return DX link"""
    
        regex = re.compile(pattern) # 8
        return [
            dxpy.dxlink(dxpy.upload_local_file(file)) # 9
            for file in filter(regex.search, paths) # 10
        ]
    
    
    dxpy.run()
    Job Log
    -------
    Watching job job-GgP7Z30071x73vpBzXK1jk7X. Press Ctrl+C to stop watching.
    * CNVKit (python_cnvkit:main) (running) job-GgP7Z30071x73vpBzXK1jk7X
      kyclark 2024-02-27 17:10:52 (running for 0:01:57)
    2024-02-27 17:13:28 CNVKit INFO Logging initialized (priority)
    2024-02-27 17:13:28 CNVKit INFO Logging initialized (bulk)
    2024-02-27 17:13:34 CNVKit INFO Downloading bundled file cnvkit_asset.tar.gz
    2024-02-27 17:14:02 CNVKit STDOUT >>> Unpacking cnvkit_asset.tar.gz to /
    2024-02-27 17:14:02 CNVKit STDERR tar: Removing leading `/' from member names
    2024-02-27 17:15:36 CNVKit INFO Setting SSH public key
    2024-02-27 17:15:39 CNVKit STDOUT dxpy/0.369.0
    (Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
    2024-02-27 17:15:40 CNVKit STDOUT Invoking main with {'bam_tumor':
    [{'$dnanexus_link': 'file-GFxXjV006kZVQPb20G85VXBp'}], 'reference':
    {'$dnanexus_link': 'file-GFxXvpj06kZfP0QVKq2p2FGF'}}
    2024-02-27 17:16:16 CNVKit STDOUT Running "cnvkit.py batch
    bams/HCC1187_1x_tumor_markdup.bam -r reference.cnn -p $(expr $(nproc) - 1) -d
    cnvkit-out --scatter"
    2024-02-27 17:19:57 CNVKit STDOUT out_files = {",".join(out_files)}
    * CNVKit (python_cnvkit:main) (done) job-GgP7Z30071x73vpBzXK1jk7X
      kyclark 2024-02-27 17:10:52 (runtime 0:07:54)
      Output: cns = [ file-GgP7jF80K7VPVpkkkzyqBK2Q ]
              cns_filtered = [ file-GgP7jF80K7V7q1jJVPYJj0pg, 
                               file-GgP7jFQ0K7VFfb7BJ3YbYy60 ]
              plot = [ file-GgP7jFQ0K7V115GPfGYB2j6b ]
    {
    "order": ["banner_image", "template_projects", "academy_links", "dnanexus_links"],
    "components": {
    "banner_image": {
         "type": "image",
         "id": "banner_image",
         "src": "#banner_image.png"
       },
     "template_projects": {
         "type": "project",
         "id": "template_projects",
         "title": "Template Projects",
         "query": {
           "tags": "Template Course",
           "limit": 5
         },
         "columns":[
           {
             "property": "name",
             "label": "Name"
           }, 
            {
              "property": "level",
              "formatter": "capitalize",
              "label": "Access"
            }
        ], 
        "viewMore": "/communities/academy_curriculum/projects",
            "minWidth": "400px"
    }, 
    
    "academy_links": {
         "type": "link",
         "id": "academy_links",
         "title": "DNAnexus Academy Links",
         "links": [
           {
             "name": "Academy Documentation",
             "href": "https://academy.dnanexus.com"
           }
         ], 
         "minWidth": "400px"
        }, 
    
    "dnanexus_links": {
         "type": "link",
         "id": "dnanexus_links",
         "title": "DNAnexus Links",
         "links": [
           {
             "name": "DNAnexus Website",
             "href": "https://www.dnanexus.com"
           },
           {
             "name": "DNAnexus Documentation",
             "href": "https://documentation.dnanexus.com"
           }
         ],
         "minWidth": "400px"
        }
    }
    }
    {
    "order": [ #LIST #HERE ],
    "components": {
      #FILL WITH SECTIONS HERE 
      }
    }
    "banner_image": {
         "type": "image", 
         "id": "banner_image", #keep the ids lower case and with no spaces 
         "src": "#banner_image.png" #you will need an image when you upload, chnange this name to whatever you want to call it, but leave the # in front of it
       },
    "template_projects": {
         "type": "project",
         "id": "template_projects",  #keep the ids lower case and with no spaces 
         "title": "Template Projects", #this is what will show up on the portal as the name 
         "query": {
           "tags": "Template Course", #this is the tag for my template course projects
           "limit": 5 #this is how many of the courses I want to show up
         },
         "columns":[ #these are the columns you want viewable as part of your table. I picked name and access level. 
           {
             "property": "name",
             "label": "Name"
           }, 
            {
              "property": "level",
              "formatter": "capitalize",
              "label": "Access"
            }
        ], 
        "viewMore": "/communities/academy_curriculum/projects", #this sets the parameter for a list of the rest of the projects with the tag that I have selected.  
            "minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another. 
    }, 
    "academy_links": {
         "type": "link",
         "id": "academy_links",  #keep the ids lower case and with no spaces 
         "title": "DNAnexus Academy Links", #title that shows up on the home page 
         "links": [
           {
             "name": "Academy Documentation", #name that shows up for the link 
             "href": "https://academy.dnanexus.com" #link I want used 
           }
         ], 
         "minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another. 
        }, 
    "dnanexus_links": {
         "type": "link",
         "id": "dnanexus_links",  #keep the ids lower case and with no spaces 
         "title": "DNAnexus Links", #title that shows up for the home page 
         "links": [
           {
             "name": "DNAnexus Website", #name that shows up for the link
             "href": "https://www.dnanexus.com" #link I want used 
           },
           {
             "name": "DNAnexus Documentation", #name that shows up for the link
             "href": "https://documentation.dnanexus.com" #link I want used 
           }
         ],
         "minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another. 
        }
    }
    "example_image": {
         "type": "image",
         "id": "example-image", #id for order purposes
         "src": "https://example.com/image.png", #you can set the source for this as a public link or with a "#" if you have the image locally. 
         "alt": "Alt text" #text
       },
    "table-example": {
         "type": "markdown", #format for the table 
         "id": "table_example", #id for the order of content 
         "title": "Table Example",
         "content": "LIST MARKDOWN CONTENT HERE FOR TABLE", #this will need to be your code for a table 
         "minWidth": "100px"
       },
    "footer": {
           "name": "DNAnexus Help",
           "href": "https://www.dnanexus.com/help"
         },
         "minWidth": "300px"

    The samtools command that is being run in the applet, including the location of the output file as /home/dnanexus/${counts_id}

    os.listdir
    Filter

    en counters

    patient_id

    string

    patients: patient_id

    many_to_one

    Another documentation stub.

  • A directory to place source code for the applet.

  • The bash script template to execute the applet.

  • This will upload the result file called output from the instance back to the project.
  • This command will link the output file as an output of the applet.

  • list of instance types
    Full Documentation

    The first call will be to the slice_bam task that will break the BAM into one file per chromosome. The input for this task is the workflow's BAM file.

  • The scatter directive in WDL causes the actions in the block to be executed in parallel, which can lead to significant performance gains. Here, the each slice file returned from the slice_bam task will be used as the input to the count_bam task.

  • The workflow defines two outputs: a BAM index file and an array of integer values representing the number of alignments in each of the BAM slices.

  • syntax in bash calls the
    seq
    function to create a sequence of integer values up the 22 human non-sex chromosomes.
  • The samtools view will display the alignments in BAM format for a region like "chr1" and place the output into the file slices/1.bam. Note the mix of ~ for WDL variables and $ for bash variables.

  • The runtime block allows you to define a Docker image that contains an installation of Samtools.

  • The output of this task is the BAM index, which is the given BAM file plus the suffix .bai, and the sliced alignment files.

  • The slices will be one or more files as indicated by Array[File], and they will be found using the glob function to look in the slices directory for all files with the extension .bam.

  • Mix ~ and $ in command blocks to dereference WDL and shell variables

  • Import WDL from external sources such as local files or remote URIs.

  • non-input declaration
    Samtools
    samtools index
    samtools view
    read_int
    import
    Full Documentation

    The command block contains the bash code that will be executed at runtime.

  • The output block equates to the outputSpec from the previous chapter. As with inputs, each output must declare a type.

  • The runtime block equates to the runSpec from the previous chapter. Here, you define that the task will use a Docker image of Ubuntu Linux 20.04.

  • to extract the outputs of a job
  • Use dx cat to see the contents of a file on the DNAnexus platform

  • v1.0
    WDL type
    jq
    Full Documentation
    $ wget -O scarlet.txt https://www.gutenberg.org/cache/epub/33/pg33.txt
    $ curl -o scarlet.txt https://www.gutenberg.org/cache/epub/33/pg33.txt
    $ wc scarlet.txt
        8590   86055  513523 scarlet.txt
    $ dx-app-wizard -h
    usage: dx-app-wizard [-h] [--json-file JSON_FILE] [--language LANGUAGE]
                         [--template {basic,parallelized,scatter-process-gather}]
                         [name]
    
    Create a source code directory for a DNAnexus app. You will be prompted for
    various metadata for the app as well as for its input and output
    specifications.
    
    positional arguments:
      name                  Name of your app
    
    optional arguments:
      -h, --help            show this help message and exit
      --json-file JSON_FILE
                            Use the metadata and IO spec found in the given file
      --language LANGUAGE   Programming language of your app
      --template {basic,parallelized,scatter-process-gather}
                            Execution pattern of your app
    $ dx-app-wizard wc
    DNAnexus App Wizard, API v1.0.0
    
    Basic Metadata
    
    Please enter basic metadata fields that will be used to describe your app.
    Optional fields are denoted by options with square brackets.  At the end of
    this wizard, the files necessary for building your app will be generated from
    the answers you provide.
    
    The name of your app must be unique on the DNAnexus platform.  After
    creating your app for the first time, you will be able to publish new versions
    using the same app name.  App names are restricted to alphanumeric characters
    (a-z, A-Z, 0-9), and the characters ".", "_", and "-".
    App Name [wc]:
    The title, if provided, is what is shown as the name of your app on
    the website.  It can be any valid UTF-8 string.
    Title []: Word Count
    The summary of your app is a short phrase or one-line description of
    what your app does.  It can be any UTF-8 human-readable string.
    Summary []: Find the number of lines, words, and characters in a file
    You can publish multiple versions of your app, and the version of your
    app is a string with which to tag a particular version.  We encourage the use
    of Semantic Versioning for labeling your apps (see http://semver.org/ for more
    details).
    Version [0.0.1]: 0.1.0
    Input Specification
    
    You will now be prompted for each input parameter to your app.  Each parameter
    should have a unique name that uses only the underscore "_" and alphanumeric
    characters, and does not start with a number.
    
    1st input name (<ENTER> to finish): input_file
    Label (optional human-readable name) []: Input file
    Your input parameter must be of one of the following classes:
    applet         array:file     array:record   file           int
    array:applet   array:float    array:string   float          record
    array:boolean  array:int      boolean        hash           string
    
    Choose a class (<TAB> twice for choices): file
    This is an optional parameter [y/n]: n
    Output Specification
    
    You will now be prompted for each output parameter of your app.  Each
    parameter should have a unique name that uses only the underscore "_" and
    alphanumeric characters, and does not start with a number.
    
    1st output name (<ENTER> to finish): output
    Label (optional human-readable name) []: Output file
    Choose a class (<TAB> twice for choices): file
    Timeout Policy
    
    Set a timeout policy for your app. Any single entry point of the app
    that runs longer than the specified timeout will fail with a TimeoutExceeded
    error. Enter an int greater than 0 with a single-letter suffix (m=minutes,
    h=hours, d=days) (e.g. "48h").
    Timeout policy [48h]: 1h
    Template Options
    
    You can write your app in any programming language, but we provide
    templates for the following supported languages: Python, bash
    Programming language: bash
    Access to the Internet (other than accessing the DNAnexus API).
    Will this app need access to the Internet? [y/N]: n
    
    Direct access to the parent project. This is not needed if your app
    specifies outputs,     which will be copied into the project after it's done
    running.
    Will this app need access to the parent project? [y/N]: n
    Default instance type: The instance type you select here will apply to
    all entry points in your app unless you override it. See https://documenta
    tion.dnanexus.com/developer/api/running-analyses/instance-types for more
    information.
    Choose an instance type for your app [mem1_ssd1_v2_x4]:
    *** Generating DNAnexus App Template... ***
    
    Your app specification has been written to the dxapp.json file. You can
    specify more app options by editing this file directly (see
    https://documentation.dnanexus.com/developer for complete documentation).
    
    Created files:
         wc/Readme.developer.md
         wc/Readme.md
         wc/dxapp.json
         wc/resources/
         wc/src/
         wc/src/wc.sh
         wc/test/
    
    App directory created!  See https://documentation.dnanexus.com/developer for
    tutorials on how to modify these files, or run "dx build wc" or "dx build
    --create-app wc" while logged in with dx.
    
    Running the DNAnexus build utility will create an executable on the DNAnexus
    platform.  Any files found in the resources directory will be uploaded
    so that they will be present in the root directory when the executable is run.
    $ find wc
    wc 
    wc/test # 1
    wc/resources #2
    wc/dxapp.json # 3
    wc/Readme.md # 4
    wc/Readme.developer.md # 5
    wc/src # 6
    wc/src/wc.sh # 7
    {
      "name": "wc",
      "title": "Word Count",
      "summary": "Find the number of lines, words, and characters in a file",
      "dxapi": "1.0.0",
      "version": "0.1.0",
      "inputSpec": [
        {
          "name": "input_file",
          "label": "Input file",
          "class": "file",
          "optional": false,
          "patterns": [
            **"*.txt"**
          ],
          "help": ""
        }
      ],
      "outputSpec": [
        {
          "name": "output",
          "label": "Output",
          "class": "file",
          "patterns": [
            "*"
          ],
          "help": ""
        }
      ],
      "runSpec": {
        "timeoutPolicy": {
          "*": {
            "hours": 1
          }
        },
        "interpreter": "bash",
        "file": "src/wc.sh",
        "distribution": "Ubuntu",
        "release": "20.04", 
        "version": "0" 
      },
      "regionalOptions": {
        "aws:us-east-1": {
          "systemRequirements": {
            "*": {
              "instanceType": "mem1_ssd1_v2_x4"
            }
          }
        }
      }
    }
    #!/bin/bash
    
    set -exo pipefail 
    
    main() {
        echo "Value of input_file: '$input_file'"
    
        dx download "$input_file" -o input_file 
    
        wc input_file > output.txt 
    
        output_id=$(dx upload output.txt --brief) 
    
        dx-jobutil-add-output output "$output_id" --class=file 
    }
    $ dx new project wc
    Created new project called "wc" (project-GGyG8K80K9ZKzkX812yY893V)
    Switch to new project now? [y/N]: y
    $ dx select project-GGyG8K80K9ZKzkX812yY893V
    Selected project project-GGyG8K80K9ZKzkX812yY893V
    $ dx upload scarlet.txt
    [===========================================================>]
    Uploaded 513,523 of 513,523 bytes (100%) scarlet.txt
    ID                    file-GGyG8z00K9Z9GQ9jG4qB4gpX
    Class                 file
    Project               project-GGyG8K80K9ZKzkX812yY893V
    Folder                /
    Name                  scarlet.txt
    State                 closing
    Visibility            visible
    Types                 -
    Properties            -
    Tags                  -
    Outgoing links        -
    Created               Tue Oct  4 16:40:44 2022
    Created by            kyclark
    Last modified         Tue Oct  4 16:40:47 2022
    Media type
    archivalState         "live"
    cloudAccount          "cloudaccount-dnanexus"
    $ dx ls -l
    Project: wc (project-GGyG8K80K9ZKzkX812yY893V)
    Folder : /
    State   Last modified       Size      Name (ID)
    closed  2022-10-04 16:40:48 501.49 KB scarlet.txt (file-GGyG8z00K9Z9GQ9jG4qB4gpX)
    $ dx build -f
    {"id": "applet-GGyGVP00K9Z4Z6VgBgkk0b06"}
    $ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -h
    usage: dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 [-iINPUT_NAME=VALUE ...]
    
    Applet: Word Count
    
    Find the number of lines, words, and characters in a file
    
    Inputs:
      Input file: -iinput_file=(file)
    
    Outputs:
      Output: output (file)
    $ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06
    Entering interactive mode for input selection.
    
    Input:   Input file (input_file)
    Class:   file
    
    Enter file ID or path (<TAB> twice for compatible files in current directory,
    '?' for more options)
    input_file: file-GGyG8z00K9Z9GQ9jG4qB4gpX
    
    Using input JSON:
    {
        "input_file": {
            "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
        }
    }
    
    Confirm running the executable with this input [Y/n]: n
    $ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -iinput_file=file-GGyG8z00K9Z9GQ9jG4qB4gpX
    
    Using input JSON:
    {
        "input_file": {
            "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
        }
    }
    
    Confirm running the executable with this input [Y/n]: n
    $ cat inputs.json
    {
        "input_file": {
            "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
        }
    }
    $ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -f inputs.json -y --watch
    
    Using input JSON:
    {
        "input_file": {
            "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
        }
    }
    
    Calling applet-GGyGVP00K9Z4Z6VgBgkk0b06 with output destination
      project-GGyG8K80K9ZKzkX812yY893V:/
    
    Job ID: job-GGyGZPQ0K9Z7PXybBp52P3xF
    
    Job Log
    -------
    Watching job job-GGyGZPQ0K9Z7PXybBp52P3xF. Press Ctrl+C to stop watching.
    2022-10-04 17:08:36 Word Count STDERR + wc input_file
    2022-10-04 17:08:36 Word Count STDERR ++ dx upload output --brief
    2022-10-04 17:08:37 Word Count STDERR + output=file-GGyGf100qZbvFjb3GqfG6kzj
    2022-10-04 17:08:37 Word Count STDERR + dx-jobutil-add-output output
    file-GGyGf100qZbvFjb3GqfG6kzj --class=file
    $ dx cat file-GGyGf100qZbvFjb3GqfG6kzj
      8590  86055 513523 input_file
    version 1.0
    
    workflow bam_chrom_counter { 
        input {
            File bam 
        }
    
        String docker_img = "quay.io/biocontainers/samtools:1.12--hd5e65b6_0" 
    
        call slice_bam {
            input : bam = bam, 
                    docker_img = docker_img
        }
    
        scatter (slice in slice_bam.slices) { 
            call count_bam {
                input: bam = slice,
                       docker_img = docker_img
            }
        }
    
        output { 
            File bai = slice_bam.bai
            Array[Int] count = count_bam.count
        }
    }
    task slice_bam {
        input { 
            File bam
            String docker_img
        }
    
        command <<< 
        set -ex
        samtools index "~{bam}" 
        mkdir slices
    
        for i in $(seq 22); do 
            samtools view -b -o "slices/$i.bam" "~{bam}" "chr${i}" 
        done
        >>>
    
        runtime { 
            docker: docker_img
        }
    
        output { 
            File bai = "~{bam}.bai"
            Array[File] slices = glob("slices/*.bam") 
        }
    }
    task count_bam {
        input {
            File bam 
            String docker_img
        }
    
        command <<<
            samtools view -c "~{bam}" 
        >>>
    
        runtime {
            docker: docker_img
        }
    
        output {
            Int count = read_int(stdout()) 
        }
    }
    $ miniwdl check workflow.wdl
    workflow.wdl
        workflow bam_chrom_counter
            call slice_bam
            scatter slice
                call count_bam
        task count_bam
        task slice_bam
    $ java -jar ~/dxCompiler-2.10.2.jar compile workflow.wdl \
            -archive \
            -folder /workflows \
            -project project-GFPQvY007GyyXgXGP7x9zbGb
    workflow-GFqF27j07GyZ33JX4vzqgK32
    $ dx run workflow-GFqF27j07GyZ33JX4vzqgK32 \
    > -istage-common.bam=file-G8V38KQ0zQ713kZGF6xQQvjJ -y
    
    Using input JSON:
    {
        "stage-common.bam": {
            "$dnanexus_link": "file-G8V38KQ0zQ713kZGF6xQQvjJ"
        }
    }
    
    Calling workflow-GFqF27j07GyZ33JX4vzqgK32 with output destination
      project-GFPQvY007GyyXgXGP7x9zbGb:/
    
    Analysis ID: analysis-GFqF7Zj07GyZQ957Jy822gQY
    version 1.0
    
    task slice_bam {
        input {
            File bam
            String docker_img
        }
    
        command <<<
        set -ex
        samtools index "~{bam}"
        mkdir slices
    
        for i in $(seq 22); do
            samtools view -b -o "slices/$i.bam" "~{bam}" "chr${i}"
        done
        >>>
    
        runtime {
            docker: docker_img
        }
    
        output {
            File bai = "~{bam}.bai"
            Array[File] slices = glob("slices/*.bam")
        }
    }
    version 1.0
    
    task count_bam {
        input {
            File bam
            String docker_img
        }
    
        command <<<
            samtools view -c "~{bam}"
        >>>
    
        runtime {
            docker: docker_img
        }
    
        output {
            Int count = read_int(stdout())
        }
    }
    version 1.0
    
    import "./tasks/slice_bam.wdl" as task_slice_bam 
    import "./tasks/count_bam.wdl" as task_count_bam
    
    workflow bam_chrom_counter {
        input {
            File bam
        }
    
        String docker_img = "quay.io/biocontainers/samtools:1.12--hd5e65b6_0"
    
        call task_slice_bam.slice_bam as slice_bam { 
            input : bam = bam,
                    docker_img = docker_img
        }
    
        scatter (slice in slice_bam.slices) {
            call task_count_bam.count_bam as count_bam { 
                input: bam = slice,
                       docker_img = docker_img
            }
        }
    
        output {
            File bai = slice_bam.bai
            Array[Int] count = count_bam.count
        }
    }
    version 1.0 
    
    task wc_wdl { 
        input {
            File input_file 
        }
    
        command {
            wc ~{input_file} > wc.txt 
        }
    
        output {
            File outfile = "wc.txt" 
        }
    
        runtime {
            docker: "ubuntu:20.04" 
        }
    }
    $ java -jar ~/womtool.jar validate wc.wdl
    Success!
    $ miniwdl check wc.wdl
    wc.wdl
        task wc
    $ miniwdl check wc.wdl
    (wc.wdl Ln 13 Col 9) Unknown type Fiel
                Fiel outfile = "wc.txt"
                ^^^^^^^^^^^^^^^^^^^^^^^
    java -jar ~/womtool.jar validate wc.wdl
    Failed to process task definition 'wc' (reason 1 of 1):
    No struct definition for 'Fiel' found in available structs: []
    make: *** [validate] Error 1
    $ java -jar ~/dxCompiler.jar compile wc.wdl
    [warning] Project is unspecified...using currently selected project
    project-GGyG8K80K9ZKzkX812yY893V
    applet-GJ3PxPj0K9Z68x1Y5zK4236B
    $ dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B -h
    usage: dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B [-iINPUT_NAME=VALUE ...]
    
    Applet: wc_wdl
    
    Inputs:
      input_file: -iinput_file=(file)
    
     Reserved for dxCompiler
      overrides___: [-ioverrides___=(hash)]
    
      overrides______dxfiles: [-ioverrides______dxfiles=(file)
        [-ioverrides______dxfiles=... [...]]]
    
    Outputs:
      outfile: outfile (file)
    $ dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B \
    > -iinput_file=file-GGyG8z00K9Z9GQ9jG4qB4gpX -y --watch
    
    Using input JSON:
    {
        "input_file": {
            "$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
        }
    }
    
    Calling applet-GJ3PxPj0K9Z68x1Y5zK4236B with output destination
      project-GGyG8K80K9ZKzkX812yY893V:/
    
    Job ID: job-GJ3Q0V80K9Z54K2X9Bzf2v0B
    
    Job Log
    -------
    Watching job job-GJ3Q0V80K9Z54K2X9Bzf2v0B. Press Ctrl+C to stop watching.
    $ dx describe job-GJ3Q0V80K9Z54K2X9Bzf2v0B --json | jq .output
    {
      "outfile": {
        "$dnanexus_link": "file-GJ3Q10Q0b0qvyB6fG7pgx0bX"
      }
    }
    $ dx cat file-GJ3Q10Q0b0qvyB6fG7pgx0bX
      8590  86055 513523 /home/dnanexus/inputs/input1217954139984307828/scarlet.txt

    Cloud Workstation

    The cloud_workstation app provides a Linux (Ubuntu) terminal running in the cloud, which is the same base execution environment for all DNAnexus apps. This is used most often for testing application code and building Docker images. I especially favor the cloud workstation whenever I need to work with large data files that I don't wish to copy to my local disk (laptop) as the transfer speeds are internal to AWS rather than over the open internet. If you have previously been limited to HPC environments where sysadmins determine what software may or may not be installed, you will find that you have sudo privileges to install any software you like, via apt, downloading pre-built binaries, or building from source code.

    In order to run cloud workstation, you will need to set up a ssh key pair. You can do this by running the following command

    Here is the start of the usage for the app:

    As noted in the following usage, the default timeout is one hour, but can be changed if you need to.

    In the preceding command, I also use the following flags from dx run:

    • -imax_session_length="2h": changes the max session length to 2 hours

    • -y|--yes: Do not ask for confirmation before launching job

    • --ssh: Configure the job to allow SSH access and connect to it after launching. Defaults --priority to high.

    By default, this app will choose an 8-core in instance type such as "mem1_ssd1_v2_x8" (16G RAM, 200G disk) for AWS:us-east-1. This is usually adequate for my needs, but if I need more memory or disk space, I can specify any valid the --instance-type argument:

    This is actually an argument to dx run, not the cloud workstation app. You can use this argument with any app to override the default instance chosen by the app developer.

    The app produces no outputs. In the following sections, I want to focus on the inputs.

    Maximum Session Length

    As noted in the following usage, the default timeout is one hour.

    You can set the usage to a different length by doing the following command, which sets the limit for 2 hours:

    When on the workstation, you can find how much time is left using dx-get-timeout:

    If you would like to extend the time left, use dx-set-timeout with the same values shown previously for session length. For example, you can set the timeout back to 2 hours and verify that you now have 2 hours left:

    Input Files

    You can initiate the app with any files you want copied to the instance:

    One of the main use cases for the cloud workstation is working with large files, and I will mostly use dx download on the instance to download what I want. An especially important case is when I want to download a file to STDOUT rather than to a local file, in which case I would not want to initiate the app using this input. For example, when dealing with a tarball of an entire Illumina BCL run directory, I would prefer to download to STDOUT and pipe this into tar:

    The alternative would require at least twice the disk space (to download the tarball and then expand the contents).

    Snapshot

    You can save the state of a workstation---called a "snapshot"---and start a new workstation using that saved state:

    For instance, you may go through a lengthy build of various packages to create the environment you need to run some application that will be lost when the workstation stops.

    To demonstrate, I will show that the Python module "pandas" is not installed by default:

    I use python3 -m pip install pandas to install the module, then dx-create-snapshot to save the state of the machine, which shows:

    I can use the file ID of the snapshot to reconstitute my environment:

    Now I find that "pandas" does exist on the image:

    You can use a snapshot file ID as an asset for native applets.

    Instance Type

    By default, this app will choose an 8-core in instance type such as "mem1_ssd1_v2_x8" (16G RAM, 200G disk) for AWS:us-east-1. This is usually adequate for my needs, but if I need more memory or disk space, I can specify any valid the --instance-type argument:

    This is actually an argument to dx run, not the cloud workstation app. You can use this argument with any app to override the default instance chosen by the app developer.

    Running Cloud Workstation

    When the app secures an instance, you will be greeted by the following messages. The first shows the job ID, instance type, project ID, and the workspace container:

    The next part explains that you are running the terminal multiplexer:

    This means that pressing Ctrl-A to jump to the beginning of the line in the terminal will trigger the following Byobu configuration screen where you are prompted to choose whether to use Screen or Emacs mode:

    If you choose Screen mode, then Byobu will emulate keystrokes, such as:

    • Ctrl-A, N: Next window

    • Ctrl-A, C: Create window

    • Ctrl-A, ": show list of windows

    The next message is perhaps the most important:

    This means that if you lose your connection to the workstation, the job will still continue running until you manually terminate it or the maximum session length is reached. For instance, you may lose your internet connection or accidentally close your terminal application. Also, your connection will be lost after an extended period of inactivity. To reconnect, use dx find jobs to find the job ID of the cloud workstation, and then use dx ssh <job-id> to pick up the Byobu session with all your work and windows in the same state.

    Next, the message recommends you press F1 to read more about Byobu and how to switch screens:

    Finally, the message reminds you that you have sudo privileges to install anything you like. The dx-toolkit is also installed, so you can run all dx commands:

    The preceeding tip to use htop is especially useful. When developing application code, I will typically choose an instance type I estimate is appropriate to a task. I will download sample input files, install all the required software, run the commands needed for the app, then open a new screen (Ctrl-A, C) and run htop there to see resource usage.

    This tip is also useful once you learn to build and run apps. You can shell into a running job using dx ssh <job-id> and connect to Byobu. To see how the system is performing in real time to a given input, you can use Ctrl-A, C to open a new screen to run htop.

    The cloud workstation comes with several programming languages installed:

    • bash 5.x

    • Python 3.x

    • R 3.x

    • Perl 5.x

    Note that you are not your DNAnexus username on the workstation but rather the dnanexus user:

    This is not to be confused with your DNAnexus ID:

    Relationship to Parent Project

    Like any job, a cloud workstation must be run in the context of a DNAnexus project; however, if I execute dx ls on the workstation, I will not see the contents of the project. This is because the containing workspace is created for the job, which I can see the "Current workspace" value in dx env:

    I can see more details by searching the workstation's environment for all the variables starting with DX:

    The $DX_PROJECT_CONTEXT_ID variable contains the project ID:

    I can run use this variable to see the parent project:

    Any files left on the workstation after termination will be permanently destroyed. If I use dx upload to save my work, it will go into the workspace's container, not the parent project. To resolve this, I use the $DX_PROJECT_CONTEXT_ID variable to upload some output file to a results folder in the parent project:

    Alternatively, I can run remove the DX_WORKSPACE_ID variable and change directories into the $DX_PROJECT_CONTEXT_ID:

    After the preceeding command, dx ls and dx upload will reference the parent project rather than the container workspace.

    The ttyd app runs a similar Linux terminal in the browser. Here are some differences to note:

    • You will enter as the root user.

    • Commands like dx ls and dx upload will default to the project not a container workspace.

    • There is no maximum session length, so ttyd runs until manually terminated. This can be costly if you forget to shut down the terminal.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Importing Nf-Core

    Import via the User Interface (UI)

    Here we will import the nf-core Sarek pipeline from github to demonstrate the functionality, but you can import any Nextflow pipeline from Github, not just nf-core ones!

    Go to a DNAnexus project. Click Add and in the drop down menu select 'Import Pipeline/Workflow'

    Next enter the required information (see below) and click 'Start Import'

    The github url is from the url of the Sarek github repo (not what is in 'Clone' in the repo)

    Make sure there is no slash after 'sarek' in the URL as it will cause the importer to fail.

    Choose your folder in the USERS folder to output the applet to.

    To see the possible releases to use, in the github project click 'Tags'. If you leave this part blank it will use the 'main' branch for that repo.

    Click the 'Monitor' tab in your project to see the running/finished import job

    You should see your applet in the the output folder that you specified in your project

    You can see the version of dxpy that it was built with by looking at the job log for the import job

    To do this click 'View Log' on the right hand side of the screen

    The job log shows that the version of dxpy used here is dxpy v0.369.0

    Test run the nfcore pipeline from the UI

    We will run the test profile for sarek which should take 40 mins to 1 hour to run. The test profile inputs are the nextflow outdir and -profile test,docker ()

    1. Click one of the sarek applets that you created

    1. Choose the platform output location for your results.

      Click on 'Output to' then make a folder or choose an existing folder. I choose the outputs folder.

    1. Click 'Next'

    Output directory considerations

    1. Specify the nextflow output directory.

    This is a directory local to the machine that Nextflow will be running on not a DNAnexus path.

    The outdir path must start with ./ or have no slashes in front of it so that the executor will be able to make this folder where its is running on the head node. For example ./results and results are both valid but /results or things like dx://project-xx:/results etc will not produce output in your project. Once the dnanexus nextflow executor detects that all files have been written to this folder (and thus all subjobs completed), it will copy this folder to the specified job destination on platform. In the event that the pipeline fails before completion, this folder will not be written to the project.

    Here I have chosen to place the nextflow output files in a directory on the head node of the run named ./test. This creates an outdir called test.

    Thus once this job completes, my results will be in dx://project-xxx:/outputs/test

    More details about this are found in our Documentation.

    Where test is the folder that was copied from the head node of the Nextflow run to the destination that I specified for it on platform.

    1. Scroll down and in 'Nextflow Options', 'Nextflow Run Options'

      type -profile test,docker

      You must use Docker for all Nextflow pipelines run on DNAnexus. Every nf-core pipeline has a Docker profile in it's nextflow.config file. You need to specify -profile docker in the Nextflow run options ('Nextflow Run Options' on UI, -inextflow_run_opts in CLI) of the applet CLI or UI to get it to use Docker containers for each process.

    2. Then click 'Start Analysis'. You will be brought to this screen

    Go to the Monitor tab to see your running job.

    Note! The estimated cost per hour is the cost to run the head node only! Each instance of the nextflow processes (subjobs) will have their own instances with their own costs.

    Import via the CLI

    Select a project to build the applet in

    and choose the number associated with your project.

    Or select your project using its name or project ID

    Replace the folder name with your folder name

    This will place the sarek applet in a folder called sarek_v3.4.0_cli_import in the /USERS/FOLDERNAME folder in the project.

    You can see the job running/completed in the Monitor tab of your project.

    If you are using a private github repository, you can supply a git credentials file to dx build using the --git-credentials option. The git credentials file has the following format.

    It must be stored in a project on platform. For more information on this file see .

    Build via the CLI from a Local Folder

    Build the Nextflow pipeline from a folder on your local machine

    This approach is useful for building pipelines that you have built yourself into Nextflow applets and for pipelines that you do not have in a github repository.

    It is also useful if you need to alter something from a public repo locally (e.g. change some code in a file to fix a bug without fixing it in the public repo) and want to build using the locally updated directory instead of the git repo.

    Additionally, if you want to use the most up-to-date dxpy version, you will need to use this approach. Sometimes the workers executing the remote repository builds can be a version or two behind the latest release of dxpy. You may want to use the latest version of dxpy for instance if there was a bug in the Nextflow executor bundled with an older dxpy version that you do not want to run into.

    For example, running dx version shows that I am using dx v0.370.2 which is what will be used for the applet we build with this approach.

    However, we saw the UI and CLI import jobs used dxpy v0.369.0, which is .

    Clone the git repository

    Once you have selected the project to build in using dx select, then build using the --nextflow flag

    You should see an applet ID if it has built successfully.

    Note that this approach does not generate a job log and it will use the version of dxpy on your local machine. So if using dxpy v0.370.2, then the applet will be packaged with this version of dxpy and its corresponding version of nextflow (23.10.0 in this case)

    Test run the nfcore pipeline from the CLI

    To see the help command for the applet:

    Use dx run <applet-name/applet-ID> -h

    or using it's applet ID (useful when multiple versions of the applet with the same name as each version will have it's own ID). Also you can run an applet using its ID from anywhere in the project but if using its name you must dx cd etc to its folder before using it.

    Excerpt of the help command

    Run command

    To run this, copy the command to your terminal and replace 'USERS/FOLDERNAME' with your folder name

    Then press Enter.

    You should see

    Type y to proceed.

    You can also add '-y' to the run command to get it to run without prompting e.g.,

    You can track the progress of your job using the 'Monitor' tab of your project in the UI

    • Once the run successfully completes, your results will be in where test_run_cli is the folder on the head node of the nextflow run that is copied to the 'outputs' folder in your project on platform.

    Note that as destination is a DNAnexus command and not a nextflow one it starts with '--' and does not have an '=' after it.

    Controlling the number of parallel subjobs

    In the CLI

    By default the DNAnexus executor will only run 5 subjobs in parallel. You can change this by passing the -queue-size flag to nextflow_run_opts with the number you require. There is a limit of 100 subjobs per user per project for most users but you can give any number up to 1000 before it will give you an error as noted in the . For example, if you know that you are passing 20 files to a run and that only a few of subjobs can be run on all 20 files at a time you could set the queueSize to 60.

    Lets change it to 20 for our nf-core Sarek run. Then the command would be

    In the UI, the string would look as below

    To change the Queue Size for your Applet at Build Time

    You can also set the queue size when building your own applets in the nextflow.config. To change the default from 5 to 20 for your applet at build time, add this line to your nextflow.config

    or (equivalent)

    However, you can change the queue size at runtime, regardless of if it is mentioned in your nextflow.config or not, by passing -queue-size X where X is a number between 1 and 1000 to the nextflow run options.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

    dx ssh_config
    $ dx run cloud_workstation -h
    usage: dx run cloud_workstation [-iINPUT_NAME=VALUE ...]
    
    App: Cloud Workstation
    
    Version: 2.2.1 (published)
    
    This app sets up a cloud workstation which you can access by running the
    applet with the --ssh or --allow-ssh flags
    
    See the app page for more information:
      https://platform.dnanexus.com/app/cloud_workstation
    Maximum Session Length (suffixes allowed: s, m, h, d, w, M, y):
          [-imax_session_length=(string, default="1h")]
          The maximum length of time to keep the workstation running.
          Value should include units of either s, m, h, d, w, M, y for
          seconds, minutes, hours, days, weeks, months, or years
          respectively.
    $ dx run -imax_session_length="2h" app-cloud_workstation --ssh -y
    Ctrl-A, K: Kill/delete window
    DNAnexus instance type
    DNAnexus instance type
    Byobu
    GNU Screen
    Full Documentation
    $ dx run app-cloud_workstation --instance-type mem1_ssd2_v2_x72 --ssh -y
    Maximum Session Length (suffixes allowed: s, m, h, d, w, M, y):
          [-imax_session_length=(string, default="1h")]
          The maximum length of time to keep the workstation running.
          Value should include units of either s, m, h, d, w, M, y for
          seconds, minutes, hours, days, weeks, months, or years
          respectively.
    $ dx run -imax_session_length="2h" app-cloud_workstation --ssh -y
    dnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ dx-get-timeout
    0 days 1 hours 42 minutes 50 seconds
    dnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ dx-set-timeout 1d
    dnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ dx-get-timeout
    0 days 1 hours 59 minutes 57 seconds
    Files: [-ifids=(file) [-ifids=... [...]]]
          An optional list of files to download to the cloud workstation
          on startup.
    $ dx download file-XXXX -o - | tar xv
    Snapshot: [-isnapshot=(file)]
          An optional snapshot file to restore the workstation environment.
    dnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ python3
    Python 3.8.10 (default, May 26 2023, 14:05:08)
    [GCC 9.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pandas as pd
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ModuleNotFoundError: No module named 'pandas'
    Created snapshot: project-GXY0PK0071xJpG156BFyXpJF:July_11_2023_23_54.snapshot
    (file-GXfygVj071xGjVfg1KQ9B7PP)
    $ dx run app-cloud_workstation -isnapshot=file-GXfygVj071xGjVfg1KQ9B7PP -y --ssh
    dnanexus@job-GXfyj58071xB4VJ9X0yk75k3:~$ python3
    Python 3.8.10 (default, May 26 2023, 14:05:08)
    [GCC 9.4.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pandas as pd
    >>> help(pd.read_csv)
    $ dx run app-cloud_workstation --instance-type mem1_ssd2_v2_x72 --ssh -y
    Welcome to DNAnexus!
    
    This is the DNAnexus Execution Environment, running job-GXfvYxj071x5P87Fxx6f5k47.
    Job: Cloud Workstation
    App: cloud_workstation:main
    Instance type: mem1_ssd1_v2_x8
    Project: kyclark_test (project-GXY0PK0071xJpG156BFyXpJF)
    Workspace: container-GXfvYyj0p4QgFgP4zZyBFv7Y
    Running since: Tue Jul 11 21:31:40 UTC 2023
    Running for: 0:01:37
    The public address of this instance is ec2-3-90-239-144.compute-1.amazonaws.com.
    You are running byobu, a terminal session manager.
    Configure Byobu's ctrl-a behavior...
    
    When you press ctrl-a in Byobu, do you want it to operate in:
        (1) Screen mode (GNU Screen's default escape sequence)
        (2) Emacs mode  (go to beginning of line)
    
    Note that:
      - F12 also operates as an escape in Byobu
      - You can press F9 and choose your escape character
      - You can run 'byobu-ctrl-a' at any time to change your selection
    
    Select [1 or 2]:
    If you get disconnected from this instance, you can log in again;
    your work will be saved as long as the job is running.
    For more information on byobu, press F1.
    The job is running in terminal 1. To switch to it, use the F4 key
    (fn+F4 on Macs; press F4 again to switch back to this terminal).
    Use sudo to run administrative commands.
    From this window, you can:
     - Use the DNAnexus API with dx
     - Monitor processes on the worker with htop
     - Install packages with apt-get install or pip3 install
     - Use this instance as a general-purpose Linux workstation
    OS version: Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1031-aws x86_64)
    $ whoami
    dnanexus
    $ dx whoami
    kyclark
    $ dx env
    Auth token used         4Gv26bY2YJ6gJjxGkV6Qg62B51X1VF7kq3gPZp2V
    API server protocol     http
    API server host         10.0.3.1
    API server port         8124
    Current workspace       container-GXfvYyj0p4QgFgP4zZyBFv7Y
    Current folder          None
    Current user            None
    $ env | grep DX
    DX_APISERVER_PROTOCOL=http
    DX_JOB_ID=job-GXfvYxj071x5P87Fxx6f5k47
    DX_APISERVER_HOST=10.0.3.1
    DX_WATCH_PORT=8090
    DX_WORKSPACE_ID=container-GXfvYyj0p4QgFgP4zZyBFv7Y
    DX_PROJECT_CACHE_ID=container-GXfvYxj071x5P87Fxx6f5k48
    DX_SNAPSHOT_FILE=null
    DX_SECURITY_CONTEXT={"auth_token_type": "Bearer", "auth_token": "4Gv26bY2YJ6gJjxGkV6Qg62B51X1VF7kq3gPZp2V"}
    DX_RESOURCES_ID=container-GKyz0G00FY38jv564gjXxb46
    DX_THRIFT_URI=query.us-east-1.apollo.dnanexus.com:10000
    DX_APISERVER_PORT=8124
    DX_DXDA_DOWNLOAD_URI=http://10.0.3.1:8090/F/D2PRJ/
    DX_PROJECT_CONTEXT_ID=project-GXY0PK0071xJpG156BFyXpJF
    DX_RUN_DETACH=1
    $ echo $DX_PROJECT_CONTEXT_ID
    project-GXY0PK0071xJpG156BFyXpJF
    $ dx ls $DX_PROJECT_CONTEXT_ID:/
    $ dx upload output.txt --path $DX_PROJECT_CONTEXT_ID:/results
    $ unset DX_WORKSPACE_ID && dx cd $DX_PROJECT_CONTEXT_ID
  • Click 'Launch Analysis'.

  • Sarek release info
    https://github.com/nf-core/sarek/blob/3.4.0/conf/test.config#L8
    here
    here
    2 versions behind this version
    dx://project-xxx:/USERS/FOLDERNAME/test_run_cli
    Queue Size Configuration Documentation
    Full Documentation
    https://github.com/nf-core/sarek
    dx select  # press enter
    dx select project-ID
    #or
    dx select my_project_name
    dx build --nextflow --repository https://github.com/nf-core/sarek --repository-tag 3.4.0 --destination project-ID:/USERS/FOLDERNAME/sarek_v3.4.0_cli_import
    providers {
      github {
        user = 'username'
        password = 'ghp_xxxx'
      }
    }
    dx --version
    #dx v0.370.2
    git clone --branch 3.4.0 https://github.com/nf-core/sarek.git
    # Here I change the folder name to something with the version in it to help me keep track of different versions of sarek
    mv sarek sarek_v3.4.0_cli
    dx build --nextflow sarek_v3.4.0_cli --destination project-ID:/USERS/FOLDERNAME/sarek_v3.4.0_cli
    applet-xxx
    dx run sarek_v3.4.0_ui -h 
    dx run applet-ID -h
    usage: dx run sarek_v3.4.0_ui [-iINPUT_NAME=VALUE ...]
    
    Applet: sarek
    
    sarek
    
    Inputs:
      outdir: [-ioutdir=(string)]
            (Nextflow pipeline required)
    
      step: [-istep=(string)]
            (Nextflow pipeline required) Default value:mapping The pipeline starts
            from this step and then runs through the possible subsequent steps.
    
      input: [-iinput=(file)]
            (Nextflow pipeline optional) A design file with information about the
            samples in your experiment. Use this parameter to specify the location
            of the input files. It has to be a comma-separated file with a header
            row. See [usage docs](https://nf-co.re/sarek/usage#input).  If no
            input file is specified, sarek will attempt to locate one in the
            `{outdir}` directory. If no input should be supplied, i.e. when --step
            is supplied or --build_from_index, then set --input false
    ...
    dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli' -inextflow_run_opts='-profile test,docker' --destination 'project-ID:/USERS/FOLDERNAME'
     
    Using input JSON:
    {
        "outdir": "./test_run_cli",
        "nextflow_run_opts": "-profile test,docker"
    }
    Confirm running the executable with this input [Y/n]:
    dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli' -inextflow_run_opts='-profile test,docker' --destination 'project-ID:/USERS/FOLDERNAME' -y
    dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs' -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME'
    executor.queueSize = 20 
    executor {
        queueSize = 20 
    }

    Example 3: samtools

    Building a Native Applet with Bash

    Using dx-app-wizard to Create An Applet

    In this applet, I'll show how to count the number of reads in a SAM or BAM file using samtools. The (Sequence Alignment Map) is a tab-delimited text description for sequence alignments, and the BAM format is the same data but stored in binary format for more compression. As the SAM format uses a line break to delineate each record, counting the alignments could be as simple as using wc -l; however, the BAM format requires a program like samtools to read the input file, so I'll show how to install this into the applet's execution environment.

    A minimal native applet requires just two files that exist in a directory with the same name as the applet:

    • dxapp.json: a JSON-formatted

    • a bash or Python program to execute

    I'll use dx-app-wizard to create a skeleton applet structure with these files:

    First, I must give my applet a name. The prompt shows that I must use only letters, numbers, a dot, underscore, and a dash. As stated earlier, this applet name will also be the name of the directory, and I'll use samtools_count:

    Next, I'm asked for the title. Note that the prompt includes empty square brackets ([]), which contain the default value if I press Enter. As title is not required, it contains the empty string, but I will provide an informative title:

    Likewise, the summary field is not required:

    The version is also optional, and I will press Enter to take the default:

    Input Specification

    This applet requires a single input, as shows in Table 1.

    Input Name
    Label
    Type
    Optional
    Default Value

    When prompted for the first input, I'll enter the following:

    • The name of the input will be used as a variable in the bash code, so I will use only letters, numbers, and underscores as in bam or bam_file.

    • The label is optional, as noted by the empty square brackets.

    • The types include primitives like integers, floating-point numbers, and strings, as well as arrays of primitive types.

    When prompted for the second input, press Enter:

    Output Specification

    As showing in Table 2, the applet will produce a single output file containing the number of alignments:

    Output Name
    Label
    Type

    When prompted for the first output name, I enter the following:

    • This name will also become a bash variable, so best practice is to use letters, numbers, and underscores.

    • The label is optional.

    • The class must be from the preceeding list. To be reminded of the choices, press the Tab key twice.

    When prompted for the second output, press Enter:

    Additional Settings

    Here are the final settings I'll use to complete the wizard:

    Name
    Value

    Applets are required to set a maximum time for running to prevent a job from running an excessively long time. While some applets may legitimately need days to run, most probably need something in the range of 12-48 hours. As noted in the prompt, I can use m, h, or d to specify minutes, hours, or days, respectively:

    For the template language, I must select from bash or Python for the program that is executed when the applet starts. The applet code can execute any program available in the execution environment, including custom programs written in any language. I will choose bash:

    Next, I determine if the applet has access to the internet and/or the parent project. Unless the applet specifically needs access, such as to download a file at runtime, it's best to answer no:

    Lastly, I must specify a default instance type. The prompt includes an abbreviated list of . The final number indicates the number of cores, e.g., _x4 indicates 4 cores. The greater the number of cores, the more available memory and disk space. In this case, a small 4-core instance is sufficient:

    The user is always free to override the instance type using the --instance-type option to dx run.

    The final output from dx-app-wizard is a summary of the files that are created:

    1. This file should contain applet implementation details.

    2. This file should contain user help.

    3. The answers from dx-app-wizard are used to create the app metadata.

    4. The resources directory is for any additional files you want available on the runtime instance.

    The contents of the resources directory will be placed into the root directory of the runtime instance. For instance, if you create a file resources/my_tool, then it will be available on the runtime instance as /my_tool. You would either need to reference the full path (/my_tool) or expand the $PATH variable to include /. Best practice is to create the directory structure resources/usr/local/bin/, and then the file will be at /usr/local/bin/my_tool as /usr/local/bin normally part of $PATH.

    Reading dxapp.json

    Let's look at the dxapp.json that was generated by dx-app-wizard. Note that this is a simple text file that you can edit at any time:

    The inputSpec has a section for patterns where I will add a few Unix file globs to indicate acceptable file suffix:

    The outputSpec needs no update:

    The runSpec contains the timeout along with the indication to use bash to run src/samtools_count.sh. If you ever wanted to change the name or location of the run script, update this section:

    Finally, the regionalOptions indicates the default runtime instance.

    Installing Applet Dependencies

    In the preceeding runSpec, note that the applet will run on Ubuntu 20.04. This instance will include dx-toolkit and several programming languages including bash, Python 3.x, Perl 5.x, and R 3.x. Anything else needed by the applet must be installed. Edit the runSpec to include the following execDepends to install samtools at runtime using the apt package manger:

    The package_manager may be one of the following:

    • apt (Ubuntu)

    • pip (Python)

    • gem (Ruby)

    Some caveats:

    • This runs apt install every execution, which is fine for fast installs. Some packages may take 5-15 minutes to install, in which case you will pay for those extra minutes on every run.

    • Installs current version in the package manager, which may be old. For instance, apt install v1.10 as of this writing while the current version is v1.17.

    • Your applet could break if the program changes if the package manager updates to a newer version.

    Building An Asset

    An alternative is to build an asset that the applet uses. Assets have many advantages, including:

    • Build asset once

    • Runtime installs are quick decompression of tarballs

    • Assets are static and cannot break your code

    Create a new folder with the name of your asset.

    Then, create the file dxasset.json in the folder with the following contents:

    When I execute dx build_asset in the folder, a new job will run to build the asset:

    As noted, the record ID of the asset can now be used in an assetDepends section, which should replace the execDepends:

    Execute dx build_asset inside this directory to build the asset into the selected project. (You can also use the --destination option to specify where to place the asset file, which will be a tarball.)

    The build process will create a new job to build the asset.

    Writing Applet Code

    The default src/samtools_count.sh contains many lines of comments to guide you in writing your application code. Update the file to the following:

    • This is the colloquially named "shebang" line that indicates this is a bash script.

    • Althought it's not a requirement that app code be contained in a main() function, it is best practice.

    • The original template uses echo to show you the runtime value of the inputs.

    Remember that the $bam variable matches the name of the input in dxapp.json. If you ever wish to change this, be sure to update both the script and the JSON.

    Building the Applet

    Run dx build to create the applet on the DNAnexus platform.

    If you have previous built the applet, you will be prompted to use the flags -f|--overwrite or -a|--archive flags:

    As habit, I always use -f to force the build:

    Without the -d|--destination option, the applet will be placed into the root directory of the project. I like to make an apps folder to hold my applets:

    TIP: Best practice is to create folders for applets, resources, assets, etc.

    Executing the Applet

    Understanding the Code

    I'd like to discuss this code a little more. In bash, the echo command will print to the console. As in any language, this is a great way to see what's happening when your code is running. In the following line, the $bam variable will only have a value at runtime, so you will not be able to run this script locally:

    When I execute this code, I see output like the following:

    That means that the following line:

    Will execute the following command at runtime:

    Take a look at the usage for dx download to remind yourself that the -o option here is directing that the output file name be input.bam:

    The next line of code executes samtools view with the -c. Execute samtools view -h to read the documentation:

    I often use a cloud workstation to work through app building. It's the same execution environment (Ubuntu Linux), so I will install any programs I need there, download sample input files, run commands and verify the behavior and output of the tools, etc.

    If I download the input file NA12878.bam (file-FpQKQk00FgkGV3Vb3jJ8xqGV), I can run the following command to see that there are 60,777 aligments:

    I can use Unix output redirection with > to place the output into the file counts.txt and cat to verify the output:

    Therefore, the following line of code from the bash script place the count of the input BAM file into counts.txt:

    Next, I upload the counts.txt file to the platform using the --brief option that will only show the new file ID:

    In bash, I can use either backticks (``) or $() to capture the results from a command, so the following line captures the file ID into the variable counts_id:

    I use add this new file ID as an output from the job using dx-jobutil-add-output:

    Here is the last command of the script that sets the counts output variable defined in the dxapp.json to the new $counts_id value:

    Using Input File Helper Variables

    In the preceeding applet, the output filename is always counts.txt. It would be better for each output file to use the name of the input BAM. When I defined the bam input, I get four variables:

    • bam: the input file ID

    • bam_path: the default path to the downloaded input file

    • bam_name: the filename, also the output of basename($bam_path)

    The default patterns for a file input in dxapp.json is ["*"]. This matches the entire input filename, causing bam_prefix to be the empty string.

    TIP: Always be sure to set patterns to the expected file extensions.

    Given an input file of NA12878.bam, the following code will create an output file called NA12878.txt:

    1. Print out the additional variables.

    2. Download the input file to the filename. The -o option here is superfluous as the default behavior is to download the file to it's filename. In the preceeding example, I saved it to the filename input.bam.

    3. Define the variable outfile to use root of the input filename.

    When I run this code, I can see the values of the other input file variables:

    The bam_path value is the default path to write the bam file if I were to use dx-download-all-inputs. In this case, I used dx download with the -o option to write it to a file in the current working directory, so there is no file at that path.

    Using dx-download-all-inputs

    There are two ways to download the input files: one at a time or all at once. So far, I've shown the first way using dx download. The second way uses dx-download-all-inputs to download all the input files to the directory /home/dnanexus/in. This will contain a directory for each file input, so the bam input file will be placed into /home/dnanexus/in/bam as shown for the $bam_path in the preceeding section. If the input is an array:file, there will be additional numbered subdirectories for each of the runtime values.

    Following is the usage:

    I can change my code to use this:

    1. Download the input file to the default location.

    2. Use the $bam_prefix variable (e.g., NA12878) to create the outfile.

    3. Use the $bam_path variable to execute samtools with the path to the in directory.

    TIP: Using dx-download-all-inputs --parallel is best practice to download all input files as fast as possible.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    This is a required input. If an input is optional, I can also provide a default value.

    The src (pronounced "source") is a conventional place for source code, but it's not a requirement that code lives in this directory.

  • This is the bash script that will be executed when the applet is run.

  • The test directory is empty and will not be discussed in this section.

  • cpan (Perl)
  • cran (R)

  • Download the input file.
  • Execute samtools to count the alignments in the input file.

  • Upload the results file and save the new file ID.

  • Add the new file ID to the job's output.

  • bam_prefix: the filename minus any file extension defined in the patterns of the dxapp.json

    Write the output from samtools to the preferred output filename.
  • Upload the output file.

  • bam

    BAM File

    file

    No

    NA

    counts

    Counts File

    file

    Timeout Policy

    48h

    Programming language

    bash

    Access to internet

    No (default)

    Access to parent project

    No (default)

    Instance Type

    mem1_ssd1_v2_x4 (default)

    SAM format
    metadata file
    instance types
    Full Documentation
    $ dx-app-wizard
    DNAnexus App Wizard, API v1.0.0
    
    Basic Metadata
    
    Please enter basic metadata fields that will be used to describe your app.
    Optional fields are denoted by options with square brackets.  At the end of
    this wizard, the files necessary for building your app will be generated from
    the answers you provide.
    The name of your app must be unique on the DNAnexus platform.  After
    creating your app for the first time, you will be able to publish new versions
    using the same app name.  App names are restricted to alphanumeric characters
    (a-z, A-Z, 0-9), and the characters ".", "_", and "-".
    App Name: samtools_count
    The title, if provided, is what is shown as the name of your app on
    the website.  It can be any valid UTF-8 string.
    Title []: Samtools Count
    The summary of your app is a short phrase or one-line description of
    what your app does.  It can be any UTF-8 human-readable string.
    Summary []: Count SAM/BAM alignments
    You can publish multiple versions of your app, and the version of your
    app is a string with which to tag a particular version.  We encourage the use
    of Semantic Versioning for labeling your apps (see http://semver.org/ for more
    details).
    Version [0.0.1]:
    Input Specification
    
    You will now be prompted for each input parameter to your app.  Each parameter
    should have a unique name that uses only the underscore "_" and alphanumeric
    characters, and does not start with a number.
    
    1st input name (<ENTER> to finish): bam 
    Label (optional human-readable name) []: BAM File 
    Your input parameter must be of one of the following classes: 
    applet         array:file     array:record   file           int
    array:applet   array:float    array:string   float          record
    array:boolean  array:int      boolean        hash           string
    
    Choose a class (<TAB> twice for choices): file
    This is an optional parameter [y/n]: n 
    2nd input name (<ENTER> to finish):
    Output Specification
    
    You will now be prompted for each output parameter of your app.  Each
    parameter should have a unique name that uses only the underscore "_" and
    alphanumeric characters, and does not start with a number.
    
    1st output name (<ENTER> to finish): counts 
    Label (optional human-readable name) []: Counts File 
    Choose a class (<TAB> twice for choices): file 
    2nd output name (<ENTER> to finish):
    Timeout Policy
    
    Set a timeout policy for your app. Any single entry point of the app
    that runs longer than the specified timeout will fail with a TimeoutExceeded
    error. Enter an int greater than 0 with a single-letter suffix (m=minutes,
    h=hours, d=days) (e.g. "48h").
    Timeout policy [48h]:
    Template Options
    
    You can write your app in any programming language, but we provide
    templates for the following supported languages: Python, bash
    Programming language: bash
    Access Permissions
    If you request these extra permissions for your app, users will see this fact
    when launching your app, and certain other restrictions will apply. For more
    information, see
    https://documentation.dnanexus.com/developer/apps/app-permissions.
    
    Access to the Internet (other than accessing the DNAnexus API).
    Will this app need access to the Internet? [y/N]: n
    
    Direct access to the parent project. This is not needed if your app
    specifies outputs,     which will be copied into the project after it's done
    running.
    Will this app need access to the parent project? [y/N]: n
    Default instance type: The instance type you select here will apply to
    all entry points in your app unless you override it. See https://documenta
    tion.dnanexus.com/developer/api/running-analyses/instance-types for more
    information.
    Choose an instance type for your app [mem1_ssd1_v2_x4]:
    *** Generating DNAnexus App Template... ***
    
    Your app specification has been written to the dxapp.json file. You can
    specify more app options by editing this file directly (see
    https://documentation.dnanexus.com/developer for complete documentation).
    
    Created files:
         samtools_count/Readme.developer.md # 1
         samtools_count/Readme.md  # 2
         samtools_count/dxapp.json  # 3
         samtools_count/resources/  # 4
         samtools_count/src/  # 5
         samtools_count/src/samtools_count.sh # 6
         samtools_count/test/  # 7
    
    App directory created!  See https://documentation.dnanexus.com/developer for
    tutorials on how to modify these files, or run "dx build samtools_count" or
    "dx build --create-app samtools_count" while logged in with dx.
    
    Running the DNAnexus build utility will create an executable on the DNAnexus
    platform.  Any files found in the resources directory will be uploaded
    so that they will be present in the root directory when the executable is run.
    {
        "name": "samtools_count",
        "title": "Samtools Count",
        "summary": "Count SAM/BAM alignments",
        "dxapi": "1.0.0",
        "version": "0.0.1",
        "inputSpec": [
            {
                "name": "bam",
                "label": "BAM File",
                "class": "file",
                "optional": false,
                "patterns": [
                    "*.bam"
                ],
                "help": ""
            }
        ],
        "outputSpec": [
            {
                "name": "counts",
                "label": "Counts File",
                "class": "file",
                "patterns": [
                    "*"
                ],
                "help": ""
            }
        ],
        "runSpec": {
            "timeoutPolicy": {
                "*": {
                    "hours": 48
                }
            },
            "interpreter": "bash",
            "file": "src/samtools_count.sh",
            "distribution": "Ubuntu",
            "release": "20.04",
            "version": "0"
        },
        "regionalOptions": {
            "aws:us-east-1": {
                "systemRequirements": {
                    "*": {
                        "instanceType": "mem1_ssd1_v2_x4"
                    }
                }
            }
        }
    }
    {
        ...
        "runSpec": {
            "execDepends": [
                {
                    "name": "samtools",
                    "package_manager": "apt"
                }
            ],
            ...
        }
    }
    {
        "name": "samtools",
        "title": "samtools asset",
        "description": "samtools asset",
        "version": "1.10",
        "distribution": "Ubuntu",
        "release": "20.04",
        "execDepends": [
            {
              "name": "samtools",
              "package_manager": "apt"
            }
        ]
    }
    $ dx build_asset
    ...
    * samtools (create_asset_focal:main) (done) job-GXjx8yj071x69xBVz90Zypx1
      kyclark 2023-07-14 16:04:27 (runtime 0:02:05)
      Output: asset_bundle = record-GXjx9V008bgjZqj82f5ybf16
    
    Asset bundle 'record-GXjx9V008bgjZqj82f5ybf16' is built and can now be used
    in your app/applet's dxapp.json
    {
        ...
        "runSpec": {
            "assetDepends": [
                { "id": "record-GXjx9V008bgjZqj82f5ybf16" }
            ],
            ...
        }
    }
    #!/bin/bash 
    
    main() { 
        echo "Value of bam: '$bam'" 
    
        dx download "$bam" -o input.bam 
    
        samtools view -c input.bam > counts.txt 
    
        counts_id=$(dx upload counts.txt --brief) 
    
        dx-jobutil-add-output counts "$counts_id" --class=file 
    }
    $ dx build
    {"id": "applet-GXqG4Z8071x9p1FZ81K5BjGQ"}
    $ dx build
    Error: ('An applet already exists at /samtools_count (id
    applet-GXqG4Z8071x9p1FZ81K5BjGQ) and neither -f/--overwrite
    nor -a/--archive were given.',)
    $ dx build -f
    INFO:dxpy:Deleting applet(s) applet-GXqG4Z8071x9p1FZ81K5BjGQ
    {"id": "applet-GXqG5P0071xF2j1F03qv7Zz6"}
    $ dx mkdir apps
    $ dx build -d /apps/ -f
    {"id": "applet-GXqG7bQ071xKQq3JkbVjGbGv"}
    echo "Value of bam: '$bam'"
    2023-07-17 12:42:23 Samtools Count STDOUT Value of bam:
    '{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}'
    dx download "$bam" -o input.bam
    dx download '{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}' -o input.bam
    -o OUTPUT, --output OUTPUT Local filename or directory to be used
                               ("-" indicates stdout output); if not supplied or
                               a directory is given, the object's name on the
                               platform will be used, along with any applicable
                               extensions
    -c, --count                Print only the count of matching records
    $ samtools view -c NA12878.bam
    60777
    $ samtools view -c NA12878.bam > counts.txt
    $ cat counts.txt
    60777
    samtools view -c input.bam > counts.txt
    $ dx upload counts.txt --brief
    file-GXpvky0071x6jg2ZVV3fJ5xp
    $ counts_id=$(dx upload counts.txt --brief)
    $ echo $counts_id
    file-GXqFf60071x6p2fbKYzVv9pp
    $ dx-jobutil-add-output -h
    usage: dx-jobutil-add-output [-h] [--class [CLASSNAME]] [--array] name value
    
    Reads and modifies job_output.json in your home directory to be a JSON hash
    with key *name* and value  *value*.
    
    If --class is not provided or is set to "auto", auto-detection of the
    output format will occur.  In particular, it will treat it as a number,
    hash, or boolean if it can be successfully parsed as such.  If it is a
    string which matches the pattern for a data object ID, it will encapsulate
    it in a DNAnexus link hash; otherwise it is treated as a simple string.
    dx-jobutil-add-output counts "$counts_id" --class=file
    #!/bin/bash
    
    main() {
        echo "Value of bam       : '$bam'" # 1
        echo "Value of bam_path  : '$bam_path'" 
        echo "Value of bam_name  : '$bam_name'"
        echo "Value of bam_prefix: '$bam_prefix'"
    
        dx download "$bam" -o "$bam_name"  # 2
    
        outfile="$bam_prefix.txt"  # 3
    
        samtools view -c "$bam_name" > "$outfile"  # 4
    
        counts_id=$(dx upload "$outfile" --brief)  # 5
    
        dx-jobutil-add-output counts "$counts_id" --class=file # 6
    }
    Value of bam       : '{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}'
    Value of bam_path  : '/home/dnanexus/in/bam/NA12878.bam'
    Value of bam_name  : 'NA12878.bam'
    Value of bam_prefix: 'NA12878'
    $ dx-download-all-inputs -h
    usage: dx-download-all-inputs [-h] [--except EXCLUDE]
      [--parallel] [--sequential]
    
    Note: this is a utility for use by bash apps running in the DNAnexus Platform.
    
    Downloads all files that were supplied as inputs to the app.  By
    convention, if an input parameter "FOO" has value
    
        {"$dnanexus_link": "file-xxxx"}
    
    and filename INPUT.TXT, then the linked file will be downloaded into the
    path:
    
        $HOME/in/FOO/INPUT.TXT
    
    If an input is an array of files, then all files will be placed into
    numbered subdirectories under a parent directory named for the input. For
    example, if the input key is FOO, and the inputs are {A, B, C}.vcf then,
    the directory structure will be:
    
        $HOME/in/FOO/0/A.vcf
                     1/B.vcf
                     2/C.vcf
    
    Zero padding is used to ensure argument order. For example, if there are 12
    input files {A, B, C, D, E, F, G, H, I, J, K, L}.txt, the directory
    structure will be:
    
        $HOME/in/FOO/00/A.vcf
                     ...
                     11/L.vcf
    
    This allows using shell globbing (FOO/*/*.vcf) to get all the files in the
    input order.
    
    options:
      -h, --help        show this help message and exit
      --except EXCLUDE  Do not download the input with this name. (May be used
                        multiple times.)
      --parallel        Download the files in parallel
      --sequential      Download the files sequentially
    #!/bin/bash
    
    main() {
        echo "Value of bam       : '$bam'"
        echo "Value of bam_path  : '$bam_path'"
        echo "Value of bam_name  : '$bam_name'"
        echo "Value of bam_prefix: '$bam_prefix'"
    
        dx-download-all-inputs # 1
    
        outfile="$bam_prefix.txt" # 2
    
        samtools view -c "$bam_path" > "$outfile" 
    
        counts_id=$(dx upload "$outfile" --brief)
    
        dx-jobutil-add-output counts "$counts_id" --class=file
    }

    Example 4: cnvkit

    To begin, you'll create a bash app to run CNVKit, which will find "genome-wide copy number from high-throughput sequencing." Create a local directory to hold your work, and consider putting the contents into a source code repository like Git.

    In this example, you will:

    • Use various package managers to install dependencies

    • Build an asset

    • Learn to use dx-download-all-inputs and dx-upload-all-outputs

    Create a Project

    From the web interface, select "Projects → All Projects" to see your project list. Click the "New Project" button to create a new project called "CNVkit." Alternatively, use dx new project to do this from the command line. However you choose to create a project, be sure this has been selected by running dx pwd to check your current working directory and using dx select to select the project, if needed.

    Build a bash app with dx-app-wizard

    Inside your working directory, run the command dx-app-wizard cnvkit_bash to launch the . Optionally provide a title, summary, and version at the prompts.

    The Input Specification

    The app will accept two inputs:

    1. One or more BAM files of the tumor samples: Give this input the name bam_tumor with the label "BAM Tumor Files." For the class, choose array:file, and indicate that this is not an optional parameter.

    2. A reference file: Give this input the name reference with the label "Reference." For the class, choose file, and indicate that this is not an optional parameter.

    When prompted for the third input, press Enter to end the inputs.

    The Output Specification

    Define three outputs, each of the type array:file with the following names and whatever labels you feel are appropriate:

    1. cns

    2. cns_filtered

    3. plot

    Press Enter when prompted for the fourth output to indicate you are finished.

    Other Options

    • Press Enter to accept the default value for the timeout policy.

    • Type bash for the programming language.

    • Type y to indicate that the app will need internet access.

    • Type n to indicate that the app will need access to the parent project.

    You should see a message saying the app's template was created in a directory name matching the app's name. For instance, I have the following:

    • This is a JSON file containing metadata that will be used to create the app on the DNAnexus platform.

    • A stub for user documentation.

    • A stub for developer documentation.

    • A template bash script for the app's functionality.

    Examine dxapp.json

    The dxapp.json file that was created by the wizard should look like the following:

    See the for a more complete understanding of all the possible fields and their implications.

    Add Python and R Module Dependencies

    CNVkit has dependencies on both Python and R modules that must be installed before running. In the dxapp.json, you can specify dependencies that can be installed with the following package managers:

    • apt (Ubuntu)

    • pip (Python)

    • cpan (Perl)

    The Python module cnvkit can be installed via pip, but the software also requires an R module called DNAcopy that must be installed using , which must first be installed using cran. This means you'll have to manually install the DNAcopy module when the app starts.

    To add these runtime dependencies, use a text editor to update the runSpec and add the following execDepends section that will install the Python cnvkit and R BiocManager modules before the app is executed:

    Specify File Patterns for Inputs

    In the inputSpec, change the patterns to match the expected file extensions:

    • bam_files: *.bam

    • reference: *.cnn

    Your dxapp.json should now look like the following:

    Edit the bash Code

    The default bash code generated by the wizard starts with a generous header of comments that you may or may not wish to keep. The default code prints the values of the input variables, then downloads the input files individually. The app code belongs in the middle, after downloading the inputs and before uploading the outputs:

    Replace src/cnvkit_bash.sh this with the following code:

    Rather than downloading the inputs individually as in the original template, this version downloads the all inputs in parallel with the following command:

    This will create an in directory with subdirectories named according to the input names. Note that bam_files input is an array of files, so this directory will contain numbered subdirectories starting at 0 for each input file:

    Similarly, the preceding code uses dx-upload-all-outputs, which expects an out directory with subdirectories named according to each of the output specifications.

    Build the Applet

    Use dx pwd to ensure you are in the correct project and dx select to change projects, if necessary. If you are inside the bash source directory where the dxapp.json file exists, you can run dx build -f If you are in the parent directory, run dx build -f cnvkit_bash. Here is a sample output from successfully compiling the app:

    The -f|--overwrite flag indicates you wish to overwrite any previous version of the applet. You may also want to use the -a|--archive flag to move any previous versions to an archived location. You won't need either of these flags the first time you compile, but subsequent builds will require that you indicate how to handle previous versions of the applet. Run dx build --help to learn more about build options.

    Run the bash applet

    Download this BAM file and add it to the inputs directory

    Indicate an output directory, click the Run button, and then click the "View Log" to watch the job's progress.

    You can also run the applet on the command line with the -h|--help flag to verify the inputs and outputs:

    Select the input files on the web interface to note the file IDs that can be used to execute the app from the command line as follows:

    You should see output from the preceding command that includes a JSON document with the inputs:

    Note that you can place this JSON into a file and launch the applet with the inputs specified with the -f|--input-json-file option, as follows. Use dx run -h to learn about other command-line options:

    Note the job ID from dx run, and use dx watch to watch the job to completion and dx describe to view the job's metadata. Alternatively, you can use the web platform to launch the job, using the file selector to specify each of the inputs, and then use the "Monitor" view to check the job's status, and view the output reference file when job completes.

    Build an Asset

    You'll notice the applet takes quite a while to run (around 14 minutes for me) because of the module installations. You can build an asset for these installations and use this in dxapp.json. Create a directory called cnvkit_asset with the following file dxasset.json:

    Also create a Makefile with the following contents:

    Run dx build_asset to create the asset. This will launch a job that will report the asset ID at the end:

    Update the runSpec in dxapp.json to the following:

    Use dx build -f and note the new app's ID. Create a JSON input as follows:

    Launch the new app from the CLI with the following command:

    Use dx watch with the new job ID to see how the run now uses the asset to run faster. I see about a 10-minute difference with the asset.

    Review

    You learned more ways to include app dependencies using package managers and a Makefile as well as by building an asset. The first strategy happens at runtime while the latter builds all the dependencies before the applet is run, making the runtime much faster.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Press Enter to accept the default value for the instance type or select one from the list shown.

    cran (\R)
  • gem (Ruby)

  • app wizard tool
    app metadata documentation
    Bioconductor
    15MB
    BAM.zip
    archive
    Open
    Full Documentation
    $ find cnvkit_bash -type f
    cnvkit_bash/dxapp.json 
    cnvkit_bash/Readme.md 
    cnvkit_bash/Readme.developer.md 
    cnvkit_bash/src/cnvkit_bash.sh 
    {
      "name": "cnvkit_bash",
      "title": "cnvkit_bash",
      "summary": "cnvkit_bash",
      "dxapi": "1.0.0",
      "version": "0.0.1",
      "inputSpec": [
        {
          "name": "bam_tumor",
          "label": "BAM Tumor Files",
          "class": "array:file",
          "optional": false,
          "patterns": [
            "*"
          ],
          "help": ""
        },
        {
          "name": "reference",
          "label": "Reference",
          "class": "file",
          "optional": false,
          "patterns": [
            "*"
          ],
          "help": ""
        }
      ],
      "outputSpec": [
        {
          "name": "cns",
          "label": "CNS",
          "class": "array:file",
          "patterns": [
            "*"
          ],
          "help": ""
        },
        {
          "name": "cns_filtered",
          "label": "CNS Filtered",
          "class": "array:file",
          "patterns": [
            "*"
          ],
          "help": ""
        },
        {
          "name": "plot",
          "label": "Plot",
          "class": "array:file",
          "patterns": [
            "*"
          ],
          "help": ""
        }
      ],
      "runSpec": {
        "timeoutPolicy": {
          "*": {
            "hours": 48
          }
        },
        "interpreter": "bash",
        "file": "src/cnvkit_bash.sh",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0"
      },
      "access": {
        "network": [
          "*"
        ]
      },
      "regionalOptions": {
        "aws:us-east-1": {
          "systemRequirements": {
            "*": {
              "instanceType": "mem1_ssd1_v2_x4"
            }
          }
        }
      }
    }
    "runSpec": {
        "interpreter": "bash",
        "file": "src/cnvkit_bash.sh",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0",
        "execDepends": [
          {
            "name": "cnvkit",
            "package_manager": "pip"
          },
          {
            "name": "BiocManager",
            "package_manager": "cran"
          }
        ],
        "timeoutPolicy": {
          "*": {
            "hours": 48
          }
        }
    }
    {
      "name": "cnvkit_bash",
      "title": "cnvkit_bash",
      "summary": "cnvkit_bash",
      "dxapi": "1.0.0",
      "version": "0.0.1",
      "inputSpec": [
        {
          "name": "bam_tumor",
          "label": "BAM Tumor Files",
          "class": "array:file",
          "optional": false,
          "patterns": [
            "*.bam"
          ],
          "help": ""
        },
        {
          "name": "reference",
          "label": "Reference",
          "class": "file",
          "optional": false,
          "patterns": [
            "*.cnn"
          ],
          "help": ""
        }
      ],
      "outputSpec": [
        {
          "name": "cns",
          "label": "CNS",
          class": "array:file",
          "patterns": [
            "*"
          ],
          "help": ""
        },
        {
          "name": "cns_filtered",
          "label": "CNS Filtered",
          "class": "array:file",
          "patterns": [
            "*"
          ],
          "help": ""
        },
        {
          "name": "plot",
          "label": "Plot",
          "class": "array:file",
          "patterns": [
            "*"
          ],
          "help": ""
        }
      ],
      "runSpec": {
        "timeoutPolicy": {
          "*": {
            "hours": 48
          }
        },
        "execDepends": [
          {
            "name": "cnvkit",
            "package_manager": "pip"
          },
          {
            "name": "BiocManager",
            "package_manager": "cran"
          }
        ],
        "interpreter": "bash",
        "file": "src/cnvkit_bash.sh",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0"
      },
      "access": {
        "network": [
          "*"
        ]
      },
      "regionalOptions": {
        "aws:us-east-1": {
          "systemRequirements": {
            "*": {
              "instanceType": "mem1_ssd1_v2_x4"
            }
          }
        }
      }
    }
    main() {
    
        echo "Value of bam_tumor: '${bam_tumor[@]}'"
        echo "Value of reference: '$reference'"
    
        # The following line(s) use the dx command-line tool to download your file
        # inputs to the local file system using variable names for the filenames. To
        # recover the original filenames, you can use the output of "dx describe
        # "$variable" --name".
    
        dx download "$reference" -o reference
        for i in ${!bam_tumor[@]}
        do
            dx download "${bam_tumor[$i]}" -o bam_tumor-$i
        done
    
        >>>>> Here is where the app code belongs <<<<<
    
        # The following line(s) use the dx command-line tool to upload your file
        # outputs after you have created them on the local file system.  It assumes
        # that you have used the output field name for the filename for each output,
        # but you can change that behavior to suit your needs.  Run "dx upload -h"
        # to see more options to set metadata.
    
        cns=$(dx upload cns --brief)
        cns_filtered=$(dx upload cns_filtered --brief)
        plot=$(dx upload plot --brief)
    
        # The following line(s) use the utility dx-jobutil-add-output to format and
        # add output variables to your job's output as appropriate for the output
        # class.  Run "dx-jobutil-add-output -h" for more information on what it
        # does.
    
        dx-jobutil-add-output cns "$cns" --class=file
        dx-jobutil-add-output cns_filtered "$cns_filtered" --class=file
        dx-jobutil-add-output plot "$plot" --class=file
    }
    #!/bin/bash
    
    # Set pragmas to print commands and fail on errors
    set -exuo pipefail
    
    # Install required R module
    Rscript -e "BiocManager::install('DNAcopy')"
    
    # Verify the value of inputs
    echo "Value of bam_tumor: '${bam_tumor[@]}'"
    echo "Value of reference: '$reference'"
    
    # Place all inputs into the "in" directory
    dx-download-all-inputs --parallel
    
    # Use "_path" versions of inputs for file paths
    cnvkit.py batch \
        ${bam_tumor_path[@]} \
        -r ${reference_path} \
        -p $(expr $(nproc) - 1) \
        -d cnvkit-out/ \
        --scatter
    
    # Make out directories for each output spec
    mkdir -p ~/out/cns/ ~/out/cns_filtered/ ~/out/plot/
    
    # Move CNVkit outputs to the "out" directory for upload
    mv cnvkit-out/*.call.cns    ~/out/cns_filtered/
    mv cnvkit-out/*.cns         ~/out/cns/
    mv cnvkit-out/*-scatter.png ~/out/plot/
    
    # Upload and annotate all output files
    dx-upload-all-outputs --parallel
    dx-download-all-inputs --parallel
    in/bam_files/0/...
    in/bam_files/1/...
    in/reference/...
    $ dx build -f
    {"id": "applet-GFyV3kj0VGFkV8k04f3K11QY"}
    $ dx run applet-GFyV3kj0VGFkV8k04f3K11QY -h
    usage: dx run applet-GFyV2G8054JBQXY64g4F7ZKk [-iINPUT_NAME=VALUE ...]
    
    Applet: cnvkit_bash
    
    cnvkit_bash
    
    Inputs:
      BAM Tumor Files: -ibam_tumor=(file) [-ibam_tumor=... [...]]
    
      Reference: -ireference=(file)
    
    Outputs:
      CNS: cns (array:file)
    
      CNS Filtered: cns_filtered (array:file)
    
      Plot: plot (array:file)
    $ dx run -y --watch applet-GFyV3kj0VGFkV8k04f3K11QY \
        -ibam_tumor=file-GFxXjV006kZVQPb20G85VXBp \
        -ireference=file-GFxXvpj06kZfP0QVKq2p2FGF \
        --destination /outputs
    Using input JSON:
    {
        "bam_tumor": [
            {
                "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
            }
        ],
        "reference": {
            "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
        }
    }
    $ dx run -y --watch applet-GFyV3kj0VGFkV8k04f3K11QY \
            -f cnvkit_bash/inputs.json \
            --destination /outputs
    {
        "name": "cnvkit_asset",
        "title": "cnvkit_asset",
        "description": "cnvkit_asset",
        "version": "0.0.1",
        "distribution": "Ubuntu",
        "release": "20.04",
        "execDepends": [
            {
              "name": "cnvkit",
              "package_manager": "pip"
            },
            {
              "name": "BiocManager",
              "package_manager": "cran"
            }
        ]
    }
    SHELL=/bin/bash -exuo pipefail
    all:
        sudo Rscript -e "BiocManager::install('DNAcopy')"
    Asset bundle 'record-GFyVY000X1ZK3yGg4qv32GXv' is built and can now be used
    in your app/applet's dxapp.json
      "runSpec": {
        "timeoutPolicy": {
          "*": {
            "hours": 48
          }
        },
        "assetDepends": [{"id": "record-GFyVY000X1ZK3yGg4qv32GXv"}],
        "interpreter": "bash",
        "file": "src/cnvkit_bash.sh",
        "distribution": "Ubuntu",
        "release": "20.04",
        "version": "0"
      },
    $ cat inputs.json
    {
        "bam_tumor": [
            {
                "$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
            }
        ],
        "reference": {
            "$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
        }
    }
    $ dx run applet-GFyVppQ0VGFxvvx44j43YyPz -f inputs.json -y

    Example 1: hello

    To begin, you'll code a "Hello, World!" workflow that captures the output of a command into a file. WDL syntax may look familiar if you know any C-family language like Java or Perl. For example, keywords like workflow and task are used to define blocks of statements contained inside matched curly braces ({}), and variables are defined using a data type like String or File.

    In this example, you will:

    • Write a simple workflow in WDL

    • Learn two ways to capture the standard out (STDOUT) of a command block

    Getting Started

    To see this in action, make a hello directory for your work, and inside that create the file workflow.wdl with the following contents:

    • The states that the following WDL follows the specification.

    • The keyword defines a workflow name. The contents of the workflow are enclosed in matched curly braces.

    • The block describes the parameters for the workflow.

    • WDL defines several you can use to describe an input value. This workflow requires a String

    WDL is not whitespace dependent, so indentation is based on your preference.

    In the Setup section, you should have installed the tool, which can be useful to check the syntax of your WDL. The following command shows the output when there are no problems:

    Introduce an error in your WDL to see how the output changes. For instance, change the version to 2.0 and observe the error message:

    Or change the call to write_greetings:

    Cromwell will also find this error, but the message will be one of literally thousands of lines of output.

    Note that miniwdl uses a different parser than dxCompiler, and each has slightly different ideas of what constitutes valid syntax. For example, miniwdl requires commas in between input items but dxCompiler does not. In spite of their differences, I appreciate the concise reporting of errors that miniwdl provides.

    Executing WDL locally with Cromwell

    To execute this workflow locally using Cromwell, you must first create a JSON file to define the input name. Create a file called inputs.json with the following contents if you'd like to extend salutations to my friend Geoffrey:

    Next, run the following command to execute the workflow:

    The output will be copious and should include an indication that the command was successful and the output landed in a file in the cromwell-executions directory that was created:

    You can use the cat (concatenate) command to see the contents of the file. Be sure to change the file path to the one created by your execution:

    Here is another way to write the command block and capture STDOUT to a named file:

    • The command block here uses triple angle brackets to enclose the shell commands.

    • The variable must be with ~{} because of the triple angle brackets. The Unix redirect operator > is used to send the STDOUT from echo into the file out.txt.

    If you execute this version, the output should show that the file out.txt was created instead of the file stdout:

    I can use cat again to verify that the same file was created:

    Creating a WDL applet with dxCompiler

    Now that you have verified that the workflow runs correctly on your local machine, it's time to compile this onto the DNAnexus platform. First, create a project in your organization and take note of the project ID. I'll demonstrate using the dx command-line interface to create a project called Workflow Test:

    All the dx commands will print help documentation if you supply the -h or --help flags. For instance, run dx new project --help.

    You can also use the web interface, in which case you should use dx select to switch to the project. Next, use dxCompiler to compile the workflow into a workflows directory in the new project. In the following command, the dxCompiler prints the new workflow ID upon success:

    Running a Workflow from the Web Interface

    Use the web interface to inspect the new workflow as shown in Figure 1. Click on the info button (an "i" in a circle to the right of the "Run" button) to verify the workflow ID is the same as you see on the command line.

    Use the "Run" button in the web interface to launch the applet as shown in Figure 2. As shown in Figue 2, I indicate the applet's outputs should written to the outputs directory.

    Click on the "Analysis Inputs" view to specify a name for the greeting. In Figure 3, you see I have selected the name "Jonas."

    Click "Start Analysis" to start the workflow. The web interface will show the progress of running the applet as shown in Figure 4.

    Figure 5 shows check marks next to each step that has been completed. Click the button to show inputs and outputs, then click on the link to the output file, which may be stdout or out.txt depending on the version of the workflow you compiled.

    Click on the output file name to view the contents of the file as shown in Figure 6.

    Click on the "Monitor" view to see how long the job lasted and cost as shown in Figure 7.

    Running a Workflow from the Command Line

    You can also use the dx CLI to run the applet as shown in the following interactive session:

    You can also specify the input JSON on the command line as a string or a file. In the following command, I provide the JSON as a string. Also note the use of the -y (yes) flag to have the workflow run without confirmation:

    You can also place the JSON into a file like so:

    You can execute the workflow with this JSON file as follows:

    You may also run the workflow with the -h|--help flag to see how to pass the arguments on the command line:

    For instance, you can also launch the app using the following command to greet "Keith":

    However you choose to launch the workflow, the new run should be displayed in the "Monitor" view of the web interface. As shown in Figure 8, the new run finished in under 1 minute.

    To find out more about the latest run, click on job's name in the run table. As shown in Figure 9, the platform will reuse files from the first run as it sees that nothing has changed. This is called "smart reuse," and you can disable this feature if you like.

    You can also use the CLI to view the results of the run with the dx describe command:

    Notice in the preceding output that the Output lists file-GFbPkBj0XFYgB7Vj4pF8XXBQ. You can cat the contents of this file with the CLI:

    Alternately, you can download the file:

    The preceding command should create a new local file called stdout or out.txt, depending on the version of the workflow you compiled. Use the cat command again to verify the contents:

    Creating Command Shortcuts Using a Makefile

    You can create command-line shortcuts for all the steps of checking and buildingyour workflow by recording them as targets in a Makefile as follows:

    (or a similar Make program, which you may need to install) can turn the command make local into the listed Cromwell command to run one of the workflow versions. Makefiles are a handy way to document your work and automate your efforts.

    Review

    You should now be able to do the following:

    • Write a valid WDL workflow that accepts a string input and interpolates that string in a bash command.

    • Capture the standard output of a command block either using the stdout() WDL directive or by redirecting the output of a Unix command to a named file.

    • Define a File type output from a task

    In the next section, you'll learn how to accept a file input and launch parallel processes to speed execution of large tasks.

    Review

    In this chapter, you learned some more WDL functions.

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Building Nextflow Applets


    Building Nextflow Applets

    Pipeline Script Folder Structure

    Building and running nextflow pipelines on dnanexus.

    A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:

    • (Required) A major Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file using manifest.mainScript = 'myfile.nf'

    • (Optional, recommended) A nextflow.config file.

    Reviewing an example minimal nextflow applet

    Create the code for fastqc-nf

    We are going to add each file into a folder called fastqc-nf

    This is a very simple applet containing only one process which runs FASTQC on files specified using an input samplesheet or from a folder in a project on platform.

    It has only three files:

    • main.nf : The pipeline script file

    • nextflow.config : Contains config info and sets params

    • nextflow_schema.json : Specifies the information used by the UI/CLI run command to serve the nextflow params to the user on DNAnexus

    The main.nf file

    Lets look at the main.nf file. As a reminder this can be called a different name and the new name specified in the nextflow.config file using manifest.mainScript = 'myfile.nf' if needed.

    main.nf

    1. DNAnexus expects Nextflow pipelines to use the Nextflow DSL2 standard. If you have learned Nextflow after December 2022 (when Nextflow version 22.12.0 was released) you are using DSL2.

      • "In Nextflow version 22.03.0-edge, DSL2 became the default DSL version. In version 22.12.0-edge, DSL1 support was removed, and the Nextflow documentation was updated to use DSL2 by default."

    2. Each process must use a Docker container to define the software environment for the process. See for more information on using docker containers in nextflow processes. Here I am using a public docker image on quay.io. This is the same docker container used by the . You might notice that the container line in the nfcore fastqc module is missing 'quay.io'. This is because this part is expected to be given in the nextflow.config using docker.registry = quay.io

    An example of using publishDir multiple times in one process to send outputs to subfolders

    Only the 'copy' mode of publishDir is supported on DNAnexus. If you do not specify a mode, then the DNAnexus executor will use copy by default so both of the publishDir lines in the example above are valid.

    Assuming at runtime you assign outdir the value of './results', this example places all output files with the ending .html in ./results/fastqc/html and all output files with ending .zip in ./results/fastqc/zip in the head node of the nextflow run.

    The entire outdir with subfolder structure intact will be copied to platform location specifed by `--destination' in the CLI or 'Output to' in the UI, once all subjobs have been completed.

    Only relative paths are allowed for publishDir on DNAnexus and thus params.outdir (since this is where files are published to)

    General . Do not attempt to access files in the publishDir directories from within a nextflow script as this is bad practice for many reasons. Use channels to pass files between processes.

    1. In this example applet, I have placed the process and workflow parts in the main.nf script. For larger multi-process applets, you can place your processes in modules/workflows/subworkflows and import them into the main script as done in nfcore pipelines.

    The nextflow.config file

    Full File:

    Explanation of Each Section:

    1. Enable docker by default for this pipeline

    1. Define the input parameters. You can also do this in the main.nf script but by convention nfcore pipelines do it in the nextflow.config. There are three params in this workflow, 'samplesheet' which is a file input, 'reads_dir' which is a directory path and 'outdir' which is a string defining the name of the output folder.

    1. Here I have assigned samplesheet and reads_dir the value of null. Thus if the user does not provide a samplesheet or a reads_dir to the pipeline at runtime, the pipeline will fail. For items such as the samplesheet that should always or nearly always change at runtime, it is valuable to assign them a null value instead of a default so that a user does not accidentally run the pipeline with a default samplesheet thinking they have used a different one.

    2. Here outdir is assigned a default of './results'. Thus, if a user does not specify a string for outdir at runtime, it will use './results'. If a user does specify an outdir, it will use the user specified one instead.

    1. A common command to make the process fail quickly and loudly when it encounters an issue .

    Error Strategy I have not defined an error strategy in the nextflow.config file. Thus, the default (both local Nextflow executor and DNAnexus executor) strategy is 'terminate'. For more detailed information on choosing an errorStrategy

    queue-size I have also not defined the queueSize, so when this applet is run, a max of 5 subjobs will run at any one time in parallel, unless you pass the -queue-size flag to the nextflow_run_opts options for the applet

    The nextflow_schema.json file

    The nextflow_schema.json file is needed to reflect the nextflow params (--samplesheet, --reads_dir and --outdir in this case) as DNAnexus applet inputs in the CLI and UI. If it is not present, you will not get the -isamplesheet, -ireads_dir and -ioutdir options for your applet inputs. You can also use it to do parameter validation at runtime using plugins such as .

    nextflow_schema.json

    Creating a nextflow_schema.json file

    Once you have written your script and know your parameters, you can make the schema quite quickly using the . Note: do not put sensitive information into this builder as information in it is stored by nfcore for 2 weeks.

    There is also the option of using nfcore schema tools on your computer to create it. You may need to manually add in format of either file-path and directory-path to some parameters if it doesn't do it for you.

    Here we will explain how to use the

    1. In the New Schema section, click the blue Submit button to start.

    2. Near the top of the page, click the 'Add group' button. You need at least one group in your schema file to have it function on platform. All parameters must be placed into a group (you can do this by dragging and dropping them into the group). For example you might have one group called Inputs for all your input parameters and a group called Output for your output parameters with the appropriate parameters placed into the correct groups. Click required for every non optional parameter.

    3. The default type of input is a string input. For file and directory path input parameters, click the little wheel to the right

    To remove an input parameter for the pipeline from the UI and CLI, you can delete it from the nextflow_schema.json file, or place it in a section of the nextflow_schema.json file that is not referenced in the allOf section at the bottom of the json file.

    You can also remove entire sections by removing their reference from the allOf section without deleting them from the file.

    Build the nextflow applet

    Ensure that you are in the project that you want to build the applet in using dx pwd or dx env. dx select the correct project if required.

    Assuming you have the folder called fastqc-nf with these contents (main.nf is required at a minimum):

    Build applet - the applet will build in the root of your project

    If you are in the fastqc-nf folder on your machine you will need to cd .. back a level for the command below to work

    or build using --destination to set the project level folder for the applet

    or to build in root of project and just change the name to test-fastqc-nf run

    You should see an output like the one below but with a different applet ID.

    Use -a with dx build to archive previous versions of your applet and -f to force overwrite previous applet versions. The archived versions are placed in a folder called .Applet_archive in the root of the project.

    You can see the build help using dx build -h or dx build --help

    How file-path and directory-path in nextflow_schema.json affect run options

    In the DNAnexus UI:

    • file-path will be rendered as a file-picker which enables loading of a file object by selecting it in the UI (can only select one file)

    • directory-path will be rendered as a string and will appear in the UI as a text box input. You can point to a directory by typing a string path such as dx://<project-id>:/test/ in the box or multiple files in a path such as dx://<project-id>:/test/*_R{1,2}.fastq.gz

    • string

    Here is part of the fastqc-nf run setup screen

    Notice how samplesheet has 'Select File' and a file icon but outdir and reads_dir have text input boxes.

    -This is because samplesheet was given 'file-path' in the nextflow_schema.json, but outdir and reads_dir were given as directory-path which renders as a string input, hence the text-box.

    In the DNAnexus CLI:

    Run the applet with -h to see the input parameters for the applet

    Excerpt of output from command above

    • string will appear as class string e.g., for param outdir

      The default here is what we specified as the default in nextflow_schema.json. It cannot 'see' the default that we set in the nextflow.config so make sure they match when building the json.

    • directory-path will appear as class (string) e.g., for param reads_dir

    See for more information on options for nextflow_schema.json on DNAnexus.

    Running the Nextflow Pipeline Applet

    Using samplesheets

    When placing a path to a file on the DNAnexus platform in a samplesheet it would use the format of dx://project-xxx:/path/to/file

    Here is an example of a samplesheet with one sample (format of samplesheet is determined by you - this is just for illustration purposes)

    Run the applet from the UI

    1. In your project on platform, click the fastqc

    1. In the run applet screen, click 'Output to' and choose your output location.

    2. Click 'Next'

    3. At the setup screen, either input a samplesheet or a write the path reads_dir. In the image below, I have used the reads_dir param. Replace 'project-xxx' and '/path/to/reads' with your project-ID and folder name that reads are in.

    1. Review the rest of the inputs and change anything that you want e.g, turn on 'preserve_cache' etc.

    1. Click start analysis

    1. Review the name, output location etc

    1. Click 'Launch Analysis'

    Run the applet on the CLI

    Running the fastqc applet with the reads_dir as input

    • I am turning on preserve_cache and using -inextflow_run_opts in the command below for demonstration of how to add them to the command but neither are required here

    • Note that the *_{1,2}.fastq.gz is needed here for Channel.fromFilePairs to correctly pair up related files

    • I do not need -profile docker in -inextflow_run_opts

    Running the fastqc applet with the samplesheet as input

    Notice the different way that the path to the samplesheet is specified compared to the reads_dir in the previous example. You can read more about how this .

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.

    parameter called
    name
    .
  • call will execute the task named write_greeting. This similar to executing a function in code.

  • The input keyword allows you to pass arguments to the task. The workflow's name input will be passed as greet_name to the task.

  • The task keyword defines a task called write_greeting.

  • The task also defines an input block with a parameter greet_name. It would be fine to call this name because it would not conflict with workflow's name.

  • The command keyword defines a block of shell commands to execute. The block may be denoted with matched curly braces ({}) or triple angle brackets (<<</>>>).

  • The shell command echo prints a salutation to standard out, AKA STDOUT, which is the normal place for output from command-line program to appear. The variable greet_name is interpolated in the string by surrounding it with ~{} or ${} because the command block uses curly braces. When using triple angle brackets, only the first syntax is valid.

  • The output block defines an outfile variable of the type File. The contents of the file is the captured stdout from the command block.

  • The outfile is set to the file out.txt, which is created by the command block.
    .
  • Check the syntax of a WDL file using miniwdl.

  • Execute a WDL workflow on your local computer using Cromwell with the inputs defined in a JSON file.

  • Create a new project to contain a workflow.

  • Compile a WDL workflow into a DNAnexus applet using the the dxCompiler.

  • Run an applet using the web interface or the CLI.

  • Inspect the file output of an applet using the web interface or the

    CLI.

  • Download a file from the platform.

  • Use a Makefile to document and automate the steps for building and running a workflow.

  • version
    WDL 1.0
    workflow
    input
    data types
    miniwdl
    interpolated
    GNU Make
    Full Documentation
    (Optional, recommended)
    A
    nextflow_schema.json
    file. If this file is present when importing or building the executable, the imported executable will expose the nextflow input parameters to the user on the DNAnexus CLI and UI.
  • (Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the major Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.

  • (Optional) A bin folder containing scripts required by the pipeline can also be used and this will be added to the PATH environment variable by nextflow - for more info see the nextflow documentation on custom scripts and tools

  • For other files/folders such as assets, an nf-core flavored folder structure is encouraged but not required

  • for nfcore pipelines. See
    . In your own pipeline, you can do it however you please!
  • You should define the cpus, memory, disk (at least one of these 3), or you can use machineType and the name of the exact DNAnexus instance that you want to use for this process.

    For example machineType 'mem2_ssd1_v2_x2'

    If you do not specify the resources required for a process, it will by default use the mem2_ssd1_v2_x4 instance type (this is the same machine type used for the head node) and processes that require more memory than this will fail.

  • You should use the publishDir directive to capture the output files that you want to publish from each process. It is generally advisable to publish your output files to an output directory defined by params.outdir (naming doesn't matter once its consistent within your pipeline). You can have as many subfolders of your outdir as needed and you can use the publishDir directive multiple times in the same process to send different output files to different subfolders.

  • At the bottom of the popup in the Format section, for a file input, choose File path or for a directory path choose Directory path. Having these 2 correct is important for how the you specify the inputs on platform.

  • When you are finished building your schema file, click 'Finished', then 'Copy pipeline schema' and paste the information into a file called nextflow_schema.json in the same directory as your applet main.nf and nextflow.config files.

  • If you note the Schema cache ID then you can type that into the website to pull up and edit that file within 14 days.

  • is rendered as a string and appears as a text box input on the UI.

    When (string) given for parameter (used for folderpaths and strings; the input is of the 'string' class), use dx://project-XXXXX:/path/to/folder e.g., dx run fastqc-nf -ireads_dir=dx://project-GgYbKGQ0QFpxF6qkPK4KxQ6Q:/FASTQ/*_{1,2}.fastq.gz

  • file-path will appear as class file e.g. for param samplesheet:

    When (file) is given for parameter (i.e., the input is of the 'file' class), use project-XXXXX:/path/to/file e.g., dx run fastqc-nf -isamplesheet=project-XXXXX:/samplesheet-example.csv ....

  • as docker was enabled in the
    nextflow.config
    for this applet
  • --name names the job

  • See here for nextflow config file information
    From the Nextflow docs
    here
    nfcore fastqc nf module
    See reference
    nextflow publishDir advice
    Here is a more thorough explanation
    see this section
    nf-validation
    nfcore pipeline schema builder website
    nfcore tools
    nfcore pipeline schema builder website
    here
    here
    Full Documentation
    here for an example in sarek
    version 1.0 
    
    workflow hello_world { 
        input { 
            String name 
        }
    
        call write_greeting { 
            input: greet_name = name 
        }
    }
    
    task write_greeting { 
        input {
            String greet_name 
        }
    
        command { 
            echo 'Hello, ${greet_name}!' 
        }
    
        output {
            File outfile = stdout() 
        }
    }
    $ miniwdl check workflow.wdl
    workflow.wdl
        workflow hello_world
            call write_greeting
        task write_greeting
    $ miniwdl check workflow.wdl
    (workflow.wdl Ln 0 Col 0) unknown WDL version 2.0; choices:
    draft-2, 1.0, development, 1.1
    $ miniwdl check workflow.wdl
    (workflow.wdl Ln 8 Col 5) No such task/workflow: write_greetings
            call write_greetings {
            ^^^^^^^^^^^^^^^^^^^^^^
    { "hello_world.name": "Geoffrey" }
    $ java -jar ~/cromwell-82.jar run --inputs inputs.json workflow.wdl
    {
      "hello_world.write_greeting.outfile":
      "/Users/[email protected]/work/srna/wdl_tutorial/hello/
      cromwell-executions/hello_world/7f02fe78-4aff-4e01-95da-c9b6e021773d/
      call-write_greeting/execution/stdout"
    }
    $ cat cromwell-executions/hello_world/7f02fe78-4aff-4e01-95da-c9b6e021773d/call-write_greeting/execution/stdout
    Hello, Geoffrey!
    version 1.0
    
    workflow hello_world {
        input {
            String name
        }
    
        call write_greeting {
            input: greet_name = name
        }
    }
    
    task write_greeting {
        input {
            String greet_name
        }
    
        command <<< 
            echo 'Hello, ~{greet_name}!' > out.txt 
        >>>
    
        output {
            File outfile = "out.txt" 
        }
    }
    {
      "outputs": {
        "hello_world.write_greeting.outfile":
        "/Users/[email protected]/work/srna/wdl_tutorial/hello/
        cromwell-executions/hello_world/1dd3abd8-be70-418b-9a31-b4ea9d5add99/
        call-write_greeting/execution/out.txt"
      },
      "id": "1dd3abd8-be70-418b-9a31-b4ea9d5add99"
    }
    $ cat cromwell-executions/hello_world/1dd3abd8-be70-418b-9a31-b4ea9d5add99/
      call-write_greeting/execution/out.txt
    Hello, Geoffrey!
    $ dx new project "Workflow Test"
    Created new project called "Workflow Test" (project-GFbKy7Q0ff1k3fGq48ZFZ45p)
    Switch to new project now? [y/N]: y
    $ java -jar ~/dxCompiler-2.10.2.jar compile workflow.wdl -folder /workflows \
    > -project project-GFbKy7Q0ff1k3fGq48ZFZ45p
    workflow-GFbP9480ff1zVQPG48zXpfzb
    $ dx run workflow-GFbP9480ff1zVQPG48zXpfzb
    Entering interactive mode for input selection.
    
    Input:   stage-common.name (stage-common.name)
    Class:   string
    
    Enter string value ('?' for more options)
    stage-common.name: Ronald
    
    Select an optional parameter to set by its # (^D or <ENTER> to finish):
    
     [0] stage-common.overrides___ (stage-common.overrides___)
     [1] stage-common.overrides______dxfiles (stage-common.overrides______dxfiles)
     [2] stage-0.greet_name (stage-0.greet_name) [default={"$dnanexus_link": {"outputField": "name", "stage": "stage-common"}}]
     [3] stage-0.overrides___ (stage-0.overrides___)
     [4] stage-0.overrides______dxfiles (stage-0.overrides______dxfiles)
     [5] stage-outputs.overrides___ (stage-outputs.overrides___)
     [6] stage-outputs.overrides______dxfiles (stage-outputs.overrides______dxfiles)
    
    Optional param #:
    The following 1 stage(s) will reuse results from a previous analysis:
      Stage 2: outputs (job-GFbPJx80ff1gYQy5Fg3pK3GY)
    
    
    Using input JSON:
    {
        "stage-common.name": "Ronald"
    }
    
    Confirm running the executable with this input [Y/n]: y
    Calling workflow-GFbP9480ff1zVQPG48zXpfzb with output destination
      project-GFbKy7Q0ff1k3fGq48ZFZ45p:/
    
    Analysis ID: analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ
    $ dx run workflow-GFbP9480ff1zVQPG48zXpfzb -j '{"stage-common.name": "Ronald"}'
    -y
    The following 3 stage(s) will reuse results from a previous analysis:
      Stage 0: common (job-GFbPjVj0ff1ZypqJ8vQj8kjf)
      Stage 1: write_greeting (job-GFbPjVj0ff1ZypqJ8vQj8kjg)
      Stage 2: outputs (job-GFbPJx80ff1gYQy5Fg3pK3GY)
    
    
    Using input JSON:
    {
        "stage-common.name": "Ronald"
    }
    
    Calling workflow-GFbP9480ff1zVQPG48zXpfzb with output destination
      project-GFbKy7Q0ff1k3fGq48ZFZ45p:/
    
    Analysis ID: analysis-GFbPkFj0ff1k3fGq48ZFZ5Jy
    $ cat app_inputs.json
    {"stage-common.name": "Ronald"}
    $ dx run -f app_inputs.json workflow-GFbP9480ff1zVQPG48zXpfzb
    $ dx run workflow-GFbP9480ff1zVQPG48zXpfzb -h
    usage: dx run workflow-GFbP9480ff1zVQPG48zXpfzb [-iINPUT_NAME=VALUE ...]
    
    Workflow: hello_world
    
    Inputs:
     stage-common
      stage-common.name: -istage-common.name=(string)
    
     stage-common: Reserved for dxCompiler
      stage-common.overrides___: [-istage-common.overrides___=(hash)]
    
      stage-common.overrides______dxfiles: [-istage-common.overrides______dxfiles=(>
    
     stage-0
      stage-0.greet_name: [-istage-0.greet_name=(string, default={"$dnanexus_link":>
    
     stage-0: Reserved for dxCompiler
      stage-0.overrides___: [-istage-0.overrides___=(hash)]
    
      stage-0.overrides______dxfiles: [-istage-0.overrides______dxfiles=(file) [-is>
    
     stage-outputs: Reserved for dxCompiler
      stage-outputs.overrides___: [-istage-outputs.overrides___=(hash)]
    
      stage-outputs.overrides______dxfiles: [-istage-outputs.overrides______dxfiles>
    
    Outputs:
      stage-common.name: stage-common.name (string)
    
      stage-0.outfile: stage-0.outfile (file)
    $ dx run workflow-GFbP9480ff1zVQPG48zXpfzb -istage-common.name=Keith
    Result 1:
    ID                    analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ
    Class                 analysis
    Job name              hello_world
    Executable name       hello_world
    Project context       project-GFbKy7Q0ff1k3fGq48ZFZ45p
    Billed to             org-sos
    Workspace             container-GFbPjVj0ff1ZypqJ8vQj8kjb
    Workflow              workflow-GFbP9480ff1zVQPG48zXpfzb
    Priority              normal
    State                 done
    Root execution        analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ
    Parent job            -
    Stage 0               common (stage-common)
      Executable          applet-GFbP93j0ff1py9y87vzB2QQJ
      Execution           job-GFbPjVj0ff1ZypqJ8vQj8kjf (done)
    Stage 1               write_greeting (stage-0)
      Executable          applet-GFbP9380ff1XzVKkG9kyVg64
      Execution           job-GFbPjVj0ff1ZypqJ8vQj8kjg (done)
    Stage 2               outputs (stage-outputs)
      Executable          applet-GFbP9400ff1pK6v113KJQF9g
      Execution           [job-GFbPJx80ff1gYQy5Fg3pK3GY] (done)
      Cached from         analysis-GFbPJx80ff1gYQy5Fg3pK3GP
    Input                 stage-common.name = "Ronald"
                          [stage-0.greet_name = {"$dnanexus_link": {"analysis":
                           "analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ", "stage":
                           "stage-common", "field": "name", "wasInternal": true}}]
    Output                stage-common.name = "Ronald"
                          stage-0.outfile = file-GFbPkBj0XFYgB7Vj4pF8XXBQ
    Output folder         /
    Launched by           kyclark
    Created               Wed Aug  3 15:52:55 2022
    Finished              Wed Aug  3 15:54:51 2022 (Wall-clock time: 0:01:55)
    Last modified         Wed Aug  3 15:54:54 2022
    Depends on            -
    Tags                  -
    Properties            -
    Total Price           $0.00
    detachedFrom          null
    rank                  0
    priceComputedAt       1659567291327
    currency              {"dxCode": 0, "code": "USD", "symbol": "$",
                          "symbolPosition": "left",
                          "decimalSymbol": ".",
                          "groupingSymbol": ","}
    totalEgress           {"regionLocalEgress": 0, "internetEgress": 0,
                          "interRegionEgress": 0}
    egressComputedAt      1659567291327
    costLimit             null
    $ dx cat file-GFbPkBj0XFYgB7Vj4pF8XXBQ
    Hello, Ronald!
    $ dx download file-GFbPkBj0XFYgB7Vj4pF8XXBQ
    [===========================================================>] Completed 15
    of 15 bytes (100%) /Users/[email protected]/work/srna/wdl_tutorial/stdout
    $ cat stdout
    Hello, Ronald!
    WORKFLOW = workflow.wdl
    PROJECT_ID = project-GFPQvY007GyyXgXGP7x9zbGb
    DXCOMPILER = java -jar ~/dxCompiler-2.10.2.jar
    CROMWELL = java -jar ~/cromwell-82.jar
    
    check:
        miniwdl check $(WORKFLOW)
    
    local:
        $(CROMWELL) run --inputs inputs.json $(WORKFLOW)
    
    local2:
        $(CROMWELL) run workflow2.wdl
    
    app:
        $(DXCOMPILER) compile $(WORKFLOW) \
            -archive \
            -folder /workflows \
            -project $(PROJECT_ID)
    
    clean:
        rm -rf cromwell-workflow-logs cromwell-executions
    samplesheet: [-isamplesheet=(file)]
        (Nextflow pipeline required)
    // Use newest nextflow dsl - not required to add this line - only dsl2 is supported on DNAnexus
    nextflow.enable.dsl = 2
    
    log.info """\
        ===================================
                F A S T Q C - E X A M P L E
        ===================================
        samplesheet : ${params.samplesheet}
        reads_dir   : ${params.reads_dir}
        outdir      : ${params.outdir}
        """
        .stripIndent()
    
    
    process FASTQC {
    
        tag "FastQC - ${sample_id}"
    
        container 'quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0'
        cpus 2
        memory { 4.GB * task.attempt }
        
    
        publishDir "${params.outdir}", pattern: "*", mode:'copy'
    
        input:
        tuple val(sample_id), path(reads)
    
        output:
        path "*"
    
        script:
        """
         fastqc --threads ${task.cpus} $reads                      
        """
    }
    
    
    /*
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        MAIN WORKFLOW
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    */
    
    workflow {
        if (params.samplesheet != null && params.reads_dir == null) {
            
            reads_ch = Channel
                .fromPath(params.samplesheet)
                .splitCsv()
                .map { row -> tuple(row[0], row[1], row[2]) }
    
                reads_ch.view()
                FASTQC(reads_ch)
    
        } else if (params.samplesheet == null && params.reads_dir != null) {
            reads_ch = Channel.fromFilePairs(params.reads_dir)
    
            reads_ch.view()
            FASTQC(reads_ch)
    
        } else {
            error "Either samplesheet or reads_dir should be provided, not both"
        }
    }
    
    
    workflow.onComplete {
        log.info ( workflow.success ? "\nworkflow is done!\n" : "Oops .. something went wrong" )
    }
    process foo {
    
     publishDir "${params.outdir}/fastqc/html", pattern "*.html", mode:'copy'
     publishDir "${params.outdir}/fastqc/zip", pattern "*.zip"
    
    ..
    }
    // Default parameters
    
    docker {
        enabled = true
    }
    
    params {
        samplesheet = null
        reads_dir = null
        outdir = "./results"
    }
    
    // Processes should always fail if any pipe element has a non-zero exit code.
    process.shell = ['/bin/bash', '-euo', 'pipefail']
    docker {
        enabled = true
    }
    params {
        samplesheet = null
        reads_dir = null
        outdir = "./results"
    }
    // Processes should always fail if any pipe element has a non-zero exit code.
    process.shell = ['/bin/bash', '-euo', 'pipefail']
    {
      "$schema": "http://json-schema.org/draft-07/schema",
      "$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
      "title": "Nextflow pipeline parameters",
      "description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
      "type": "object",
      "definitions": {
          "inputs": {
              "title": "Inputs",
              "type": "object",
              "description": "",
              "default": "",
              "properties": {
                  "samplesheet": {
                      "type": "string",
                      "description": "Input samplesheet in CSV format",
                      "format": "file-path"
                  },
                  "reads_dir": {
                    "type": "string",
                    "description": "Reads directory for file pairs with wildcard",
                    "format": "directory-path"
                },             
                  "outdir": {
                      "type": "string",
                      "format": "directory-path",
                      "description": "Local path to output directory",
                      "default": "./results"
                  }
              }
          }
      },
      "allOf": [
          {
              "$ref": "#/definitions/inputs"
          }
      ]
    }
    #select project
    dx select project-ID
    main.nf 
    nextflow.config
    nextflow_schema.json
    dx build --nextflow fastqc-nf
    dx build -a --nextflow fastqc-nf --destination project-XXXXX:/TEST/fastqc-nf
    dx build -a --nextflow fastqc-nf --destination project-XXXXX:/test-fastqc-nf
    {"id": "applet-ID"}
    dx run fastqc-nf -h
    usage: dx run fastqc-nf [-iINPUT_NAME=VALUE ...]
    
    Applet: fastqc-nf
    
    fastqc-nf
    
    Inputs:
      outdir: [-ioutdir=(string)]
            (Nextflow pipeline required) Default value:./results
    
      reads_dir: [-ireads_dir=(string)]
            (Nextflow pipeline required)
    
      samplesheet: [-isamplesheet=(file)]
            (Nextflow pipeline required)
    
            ....
    outdir: [-ioutdir=(string)]
        (Nextflow pipeline required) Default value:./results
    sample_name,fastq_1,fastq_2
    sampleA,dx://project-xxx:/path/to/sampleA_r1.fastq.gz,dx://project-xxx:/path/to/sampleA_r2.fastq.gz
    dx run fastqc-nf \
    -ireads_dir="dx://project-ID:/FASTQ/*_{1,2}.fastq.gz" \
    -ioutdir="./fastqc-out-rd" \
    -ipreserve_cache=true \
    -inextflow_run_opts='-queue-size 10' \
    --destination "project-ID:/USERS/FOLDERNAME" \
    --name fastqc-nf-with-reads-dir \
    -y
    dx run fastqc-nf -isamplesheet="project-ID:/samplesheet-example.csv" \
    -ioutdir="./fastqc-out-sh" \
    --destination "project-ID:/USERS/FILENAME" \
    --name fastqc-nf-with-samplesheet \
    -y
    reads_dir: [-ireads_dir=(string)]
        (Nextflow pipeline required)

    Introduction to CLI

    Overview of Interacting with the Platform

    Users of the platform like to interact with it in a variety of ways (shown below), but this section is dedicated to those that want to learn how to interact with it using the command line, or CLI.

    Terms

    The CLI interacts with the platform in the following way:

    • The CLI (command line interface) is run locally on your own machine.

    • On your local machine, you will download the SDK (software development kit), which we also call dx-toolkit. Information on downloading it and other requirements is found in the Getting Started Guide. Once set up, this allows you to log into the platform and explore your data/ projects, create apps and workflows, and launch analyses.

    • API (application programming interface) Servers are used for us to interact with the Platform using HTTP requests. The arguments for this request are fields in a JSON file. If you want more details on this structure, you can go to .

    Installation

    Please ensure that you are running Python 3 before starting this install.

    To install:

    To upgrade dxpy

    Further details can be found in our if you need it.

    Introducing dx-toolkit

    The dx command will be your most used utility for interacting with the DNAnexus platform. You can run the command with no arguments or with the -h or --help flags to see the usage:

    Sometime the usage make occupy your entire terminal, in which case you may see (END) to show that you are at the end of the documentation. Press q to quit the usage, or use the universal Ctrl-C to send an interrupt signal to the process to kill it.

    Run dx help to read about the categories of commands you can run:

    Logging Into the Platform

    Let's start by using dx login to gain access to the DNAnexus platform from the command line. All dx commands will respond to -h|--help, so run the command with one of these flags to read the usage:

    The help documentation is often called the usage because that is often the first word of the output. In the previous output, notice that the all the arguments are enclosed in square brackets, e.g., [--token TOKEN]. This is a common convention in Unix documentation to indicate that the argument is optional. The lack of such square brackets means the argument is required.

    Some of the arguments require a value to follow. For example, --token TOKEN means the argument --token must be followed by the string value for the token. Arguments like --save are known as flags. They are either present or not and often represent a Boolean value, usually "True" when present and "False" when absent.

    The most basic usage for login is to enter your username and password when prompted:

    TODO: Reasons for using tokens, security, dangers. You may also generate a token in the web UI for use on the command line:

    Information on setting up tokens can be found in the section of our Documentation.

    Use dx logout to log out of the platform. This invalidates a token.

    If you are ever in doubt of your username, use dx whoami to see your identity.

    • When you ssh into a cloud workstation, you will be your normal DNAnexus user.

    • When running the ttyd app to access a cloud workstation through the UI, you will be the privileged Unix user root.

    • When you ssh into a running job, you will be the user dnanexus.

    Working with Projects and Users

    A project is the smallest unit of sharing in DNAnexus, and you must always work in the context of a project. Upon login, you will be prompted to select a project. To change projects, use dx select. Use -h|--help to view the usage:

    When run with no options, you will be presented a list of your projects and privilege:

    Press Enter to choose the first project, or select a number 0-9 to choose a project or m for "more" options. You can also provide a project name or ID as the first argument:

    Use the --level option to specify only projects where you have a particular permission. For instance, dx select --level ADMINISTER will show only projects where you are an administrator.

    Normally, projects are private to your organization, but the --public option will display the public projects that DNAnexus uses to share common resources like sequence files or indexes for reference genomes:

    Press Ctrl-C to exit the program without making a selection.

    If you are ever in doubt as to your current project, run dx pwd (print working directory):

    Alternatively, you can run dx env to see your current environment:

    If I wanted to share some data with a collaborator, I would use dx new project to create a new project to hold select data and apps. Following is the usage:

    I will use this command to create a new project in the AWS US-East-1 region. See the documentation for a list of . The command displays the new project ID and prompts to switch into the new project:

    Next, I would use dx invite <user-id> to invite users to the project. Start with the usage to see how to call the command:

    The usage to see that this command includes three positional arguments, the first of which (invitee) is required and the other two (project, permissions) are optional. Your currently selected project is the default project, and "VIEW" is the default permission. If you wish to indicate some permission other than "VIEW," you must specify the project first.

    Use dx uninvite <user-id> to revoke a user's access to a project:

    Data Exploration

    Earlier, I introduced dx pwd to print working directory to find my currently selected project.

    Notice that the output shows the project name and the directory /, which is the root directory of the project:

    The command dx ls will list the contents of a directory. Notice in the usage that the directory name is optional, in which case it will use the current working directory:

    There is nothing to list because I just created this project, so I'll add some data next.

    Copying and Moving Files

    I will use the command dx cp to copy a small file from one of the public projects into my project. I'll start with the usage:

    The usage shows source [source …​], which is another Unix convention to indicate that the argument may be repeated. This means you can indicate several source files or directories to be copied to the final destination.

    I'll copy the file hs38DH.dict from the project "Reference Genome Files: AWS US (East)" into the root directory of my new project. The command will only produce output on error:

    I must specify the source file using the project and file ID. When you refer to files inside your current project, it's only necessary to use the file ID.

    Now I can list the one file:

    Often you'll want to use the file ID, which you can view using the -l|--long flag to see the long listing that includes more metadata:

    I've decided I want to create a data directory to hold files such as this, so I will use dx mkdir data. The command will produce no output on success. A new listing shows data/ where the trailing slash indicates this is a directory:

    To move the hs38DH.dict into the data directory, I can either use the file name or ID:

    A new listing shows that the file is no longer in the root directory:

    I can specify the data directory to view the contents:

    Alternatively, I can use dx cd data to change directories. The command dx pwd will verify that I'm in the new folder:

    If I execute dx ls now, I'll see the contents of the data directory:

    Return to the root directory of the project by runing dx cd or dx cd /.

    Another way to inspect the structure of a project is using dx tree:

    With no options, you will see a tree structure of the project:

    This command will also show the long listing with -l|--long:

    Uploading Data

    I want to create a local file on my computer and add it to the project. I'll use the echo command to redirect some text into a file:

    I'll use the dx upload command. The usage shows that filename is required and may be repeated.

    There are many options to the command, and here are a few to highlight:

    • --brief: Display a brief version of the return value; for most commands, prints a DNAnexus ID per line

    • -r, --recursive: Upload directories recursively

    • --path [PATH], --destination [PATH]: DNAnexus path to upload file(s) to (default uses current project and folder if not provided)

    Run dx upload hello.txt and see that the new file exists in the root directory of your current project:

    You can also upload data using the UI. Under the "Add" menu, you will find the following:

    • Upload Data: Use your browser to add files to the project. This is the same as using dx upload.

    • Copy Data From Project: Add data from existing projects on the platform. This is the same as dx cp.

    • Add Data From Server: Add data from any publicly accessible URL such as an HTTP or FTP site. This is the same as running the app.

    In addition, we offer an app.

    I would like to check the new file on the platform. The dx cat command will, like the Unix cat concatenate command, print the entire contents of a file to the console:

    I can use this to verify that the file was correctly uploaded:

    You might expect the following command to upload hello.txt into the data directory:

    Unfortunately, this will create a file called data alongside a directory called data:

    I can verify that the data file contains "hello":

    Note this important part of upload's usage:

    This brings up an interesting point that file names are not unique on the DNAnexus platform. The only unique identifier is the file ID, and so this is always the best way to refer to a file. To rectify the duplication, I will get the file ID:

    I can remove the file using dx rm file-GXZB2180fF65j2G1197pP7By.

    If I dx upload hello.txt file again, I will not overwrite the existing file. Rather, another copy of the file will be created with a new file ID:

    The concept of immutability was covered in "Course 101 Overview of the DNA nexus Platfrom USer Interface": Remember the crucially important fact that data objects on the DNAnexus platform are immutable. They can only be created (e.g., by uploading them) or removed, but they can never be overwritten. A given object ID always points to the same collection of bits, which leads to downstream benefits like reusing the outputs of jobs that share the same executable and input IDs ().

    I cannot remove the file by filename as it's not unique, so I'm prompted to select which file I want:

    I used dx cat hello.txt to read the contents of the entire file because I knew the file had only one line. It's far safer to use dx head to look at just the first few lines (the default is 10):

    For instance, I can peek at the data/hs38DH.dict file:

    Another option to check the file is to download it:

    Inspecting Object Metadata

    Every data object on the platform has a unique identifier prefixed with the type of object such as "file-," "record-," or "applet-." Earlier, I saw that hello.txt has the ID file-GXZB1v80fF6BXJ8p7PvZPy1v. I can use the dx describe command to view the metadata:

    I could use the filename, if it's unique, but it's always best practice to use the file ID:

    As shown in the usage, the --delim option causes the output table to use whatever delimiter you want between the columns. This could be useful if you wish to parse the output programmatically. The tab character is the default delimiter, but I can use a comma like so:

    The --json flag returns the same data in JavaScript Object Notation (JSON), which we'll discuss in a later chapter:

    I can use dx describe to view the metadata associated with any object identifer on the platform. For instance, I'll use head to view the first few lines of the project's metadata:

    Find another entity ID, such as your billing org, to use with the command.

    Copying and Moving Files

    I can use dx mv to move a file or directory within a project:

    For instance, I can rename hello.txt to goodbye.txt with the command dx mv hello.txt goodbye.txt. The file ID remains the same:

    I can also move goodbye.txt to the data directory and rename it back to hello.txt. Again, the file ID remains the same because I have only changed some of the file's metadata:

    As noted in the preceeding usage, I should use dx cp to copy data from one project to another. If I attempt to copy a file within a project, I will get an error:

    The only way to make an actual copy of a file is to upload it again as I did earlier when I added the hello.txt file a second time.

    Data objects on the platform exist as bits in AWS or Azure storage, and the associated metadata is stored in a DNAnexus database. If two projects are in the same region such as AWS US-East-1, then dx cp doesn't actually copy the bits but rather creates a new database entry pointing to the object. This means you don't pay for additional storage. Copying between regions, however, does make a physical copy of the bits and will cost money for data egress and storage. When in doubt, use dx describe <project-id> to see a project's "Region" attribute or check the "Settings" in the project view UI.

    Finding Data

    The dx find command will help you search for entities including:

    • apps

    • globalworkflows

    • jobs

    • data

    I can use the dx find data command to search data objects such as files and applets. I'll display the first part of the usage as it's rather long:

    Run the command in the current project to see the two files:

    I can use the --name option to look for a file by name:

    I can also specify a Unix file glob pattern, such as all files that begin with h:

    Or all files that end with .dict. Note in this example that the asterisk is escapted with a backslash to prevent my shell from exanding it locally as I want the literal star to be given as the argument:

    The --brief flag will return only the file ID:

    This is useful, for instance, for downloading a file:

    The --json flag will return the results in JSON format. In the JSON chapter, you will learn how to parse these results for more advanced querying and data manipulation:

    The --class option accepts the following values:

    • applet

    • database

    • file

    • record

    The --state options accepts the following values:

    • open: A file that is currently being uploaded

    • closing: A file that is done uploading but is still being validated

    • closed: A file that is uploaded and validated

    There are many more options for finding data and other entities on the platform that will be covered in later chapters.

    Running Jobs

    It's time to run an app, but which one? I'd like to have a FASTQ file to work with, so I'll start by using the SRA FASTQ Importer. I can never quite remember the name of the app, so I'll search for it using a wildcard:

    The "x" in the first column indicates this is an app supported by DNAnexus.

    I can find information about the inputs and outputs to the app using either of these commands:

    • dx describe sra_fastq_importer

    • dx run sra_fastq_importer -h

    I prefer the output from the second command:

    Looking at the usage for the app, I see that only the -iaccession argument is required as all the others are shown enclosed with square brackets, e.g., [-ingc_key=(file)]. I can run the app the SRA accession (C. elegans), answering "yes" to both launching and watching the app:

    The equal sign in -iaccession=SRR070372 is required.

    The output of watching is the same as you would see from the UI if you click the "MONITOR" tab in the project view and then "View Log" while the app is running. The end of the watch shows the app ran successfully and that a new file was created in my project:

    I can find the size of the file with dx ls:

    Now I'd like to run this into FastQC. I'll search for the app by name just to be sure, and, yes, it's called "fastqc":

    Again, I use either dx describe or dx run to see that the app requires

    I will use the new file's ID as the input to FastQC, and I'll run it using the additional flags -y to confirm launching and --watch to immediately start watching the job:

    Notice that the confirmation shows "Using input JSON". If you like, you can save that to a file called, for example, input.json:

    I can then launch the job using the -f|--input-json-file argument along with the --brief flag to show only the resulting job ID:

    Since the output will be the same, I can kill the job using dx terminate job-GXf930j071xJfYqfJ2kkvk8v.

    The end of the watch shows that the job finishes successfully:

    I would like to get a feel for the output, so I'll use dx head on the stats_txt output file ID:

    Review

    You are now able to:

    • List the advantages to interacting with platform via command line interface

    • List the functions of the SDK and the API

    • Describe the purpose of the dx-toolkit

    • Apply frequently used dx-toolkit commands to execute common use cases, applicable to a broad audience of users

    Resources

    To create a support ticket if there are technical issues:

    1. Go to the Help header (same section where Projects and Tools are) inside the platform

    2. Select "Contact Support"

    3. Fill in the Subject and Message to submit a support ticket.

    Import From AWS S3: Add data from an S3 bucket. This is the same as running the AWS S3 Importer app.

    projects
  • orgs

  • org members

  • org projects

  • org apps

  • workflow

  • any: any of the above
    DNAnexus API
    Documentation
    Using Tokens
    all available regions
    URL Fetcher
    SRA FASTQ Importer
    smart reuse
    SRR070372
    Full Documentation
    pip3 install dxpy
    pip3 install –upgrade dxpy
    usage: dx [-h] [--version] command ...
    
    DNAnexus Command-Line Client, API v1.0.0, client v0.346.0
    
    dx is a command-line client for interacting with the DNAnexus platform.  You
    can log in, navigate, upload, organize and share your data, launch analyses,
    and more.  For a quick tour of what the tool can do, see
    
      https://documentation.dnanexus.com/getting-started/tutorials/cli-quickstart#q>
    
    For a breakdown of dx commands by category, run "dx help".
    
    dx exits with exit code 3 if invalid input is provided or an invalid operation
    is requested, and exit code 1 if an internal error is encountered.  The latter
    usually indicate bugs in dx; please report them at
    
      https://github.com/dnanexus/dx-toolkit/issues
    
    options:
      -h, --help  show this help message and exit
      --env-help  Display help message for overriding environment
                  variables
      --version   show program's version number and exit
    $ dx help
    usage: dx help [-h] [command_or_category] [subcommand]
    
    Displays the help message for the given command (and subcommand if given), or
    displays the list of all commands in the given category.
    
    CATEGORIES
    
      all       All commands
      session   Manage your login session
      fs        Navigate and organize your projects and files
      data      View, download, and upload data
      metadata  View and modify metadata for projects, data, and executions
      workflow  View and modify workflows
      exec      Manage and run apps, applets, and workflows
      org       Administer and operate on orgs
      other     Miscellaneous advanced utilities
    $ dx login -h
    usage: dx login [-h] [--env-help] [--token TOKEN] [--noprojects] [--save]
                    [--timeout TIMEOUT]
    
    Log in interactively and acquire credentials. Use "--token" to log in with an
    existing API token.
    
    options:
      -h, --help         show this help message and exit
      --env-help         Display help message for overriding environment variables
      --token TOKEN      Authentication token to use
      --noprojects       Do not print available projects
      --save             Save token and other environment variables for future
                         sessions
      --timeout TIMEOUT  Timeout for this login token (in seconds, or use suffix
                         s, m, h, d, w, M, y)
    $ dx login
    Acquiring credentials from https://auth.dnanexus.com
    Username: XXXXXXXX
    Password: XXXXXXXX
    $ dx login --token xxxxxxxxxxx
    $ dx select -h
    usage: dx select [-h] [--env-help] [--name NAME]
                     [--level {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}] [--public]
                     [project]
    
    Interactively list and select a project to switch to. By default, only lists
    projects for which you have at least CONTRIBUTE permissions. Use --public to
    see the list of public projects.
    
    positional arguments:
      project               Name or ID of a project to switch to; if not provided
                            a list will be provided for you
    
    options:
      -h, --help            show this help message and exit
      --env-help            Display help message for overriding environment
                            variables
      --name NAME           Name of the project (wildcard patterns supported)
      --level {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}
                            Minimum level of permissions expected
      --public              Include ONLY public projects (will automatically set
                            --level to VIEW)
    $ dx select
    
    Note: Use dx select --level VIEW or dx select --public to
    select from projects for which you only have VIEW permissions.
    
    Available projects (CONTRIBUTE or higher):
    0) App Dev (ADMINISTER)
    1) Methylation (ADMINISTER)
    2) Genomes (ADMINISTER)
    3) WTS (ADMINISTER)
    4) WGS (ADMINISTER)
    5) Exome (ADMINISTER)
    6) QC (ADMINISTER)
    7) Collaborators (ADMINISTER)
    8) Pipeline Dev (ADMINISTER)
    9) WDL Test (ADMINISTER)
    m) More options not shown...
    
    Pick a numbered choice or "m" for more options [0]:
    $ dx select project-XXXXXXXXXXXXXXXXXXXXXXXX
    $ dx select "Pipeline Dev"
    $ dx select --public
    
    Available public projects:
    0) Reference Genome Files: Azure US (West) (VIEW)
    1) App_Assets_Europe(London)_Internal (VIEW)
    2) Reference Genome Files: Azure Amsterdam (VIEW)
    3) Reference Genome Files: AWS Germany (VIEW)
    4) Reference Genome Files: AWS US (East) (VIEW)
    5) Reference Genome Files: AWS Europe (London) (VIEW)
    6) App and Applet Assets Azure (VIEW)
    7) dxCompiler_Europe_London (VIEW)
    8) dxCompiler_Sydney (VIEW)
    9) dxCompiler_Berlin (VIEW)
    m) More options not shown...
    
    Pick a numbered choice or "m" for more options:
    $ dx pwd
    Pipeline Dev:/
    $ dx env
    Auth token used         XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    API server protocol     https
    API server host         api.dnanexus.com
    API server port         443
    Current workspace       project-XXXXXXXXXXXXXXXXXXXXXXXX
    Current workspace name  "Pipeline Dev"
    Current folder          /
    Current user            test_user
    $ dx new project -h
    usage: dx new project [-h] [--brief | --verbose] [--env-help]
                          [--region REGION] [-s] [--bill-to BILL_TO] [--phi]
                          [--database-ui-view-only]
                          [name]
    
    Create a new project
    
    positional arguments:
      name                  Name of the new project
    
    options:
      -h, --help            show this help message and exit
      --brief               Display a brief version of the return value; for most
                            commands, prints a DNAnexus ID per line
      --verbose             If available, displays extra verbose output
      --env-help            Display help message for overriding environment
                            variables
      --region REGION       Region affinity of the new project
      -s, --select          Select the new project as current after creating
      --bill-to BILL_TO     ID of the user or org to which the project will be
                            billed. The default value is the billTo of the
                            requesting user.
      --phi                 Add PHI protection to project
      --database-ui-view-only
                            Viewers on the project cannot access database data
                            directly
    $ dx new project --region aws:us-east-1 demo_project
    Created new project called "demo_project" (project-GXZ90x00fF6F4fy1K20x4gv9)
    Switch to new project now? [y/N]: y
    $ dx invite -h
    usage: dx invite [-h] [--env-help] [--no-email]
                     invitee [project] [{VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}]
    
    Invite a DNAnexus entity to a project. If the invitee is not recognized as a
    DNAnexus ID, it will be treated as a username, i.e. "dx invite alice : VIEW"
    is equivalent to inviting the user with user ID "user-alice" to view your
    current default project.
    
    positional arguments:
      invitee               Entity to invite
      project               Project to invite the invitee to
      {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}
                            Permissions level the new member should have
    
    options:
      -h, --help            show this help message and exit
      --env-help            Display help message for overriding environment
                            variables
      --no-email            Disable email notifications to invitee
    $ dx uninvite -h
    usage: dx uninvite [-h] [--env-help] entity [project]
    
    Revoke others' permissions on a project you administer. If the entity is not
    recognized as a DNAnexus ID, it will be treated as a username, i.e. "dx
    uninvite alice :" is equivalent to revoking the permissions of the user with
    user ID "user-alice" to your current default project.
    
    positional arguments:
      entity      Entity to uninvite
      project     Project to revoke permissions from
    
    options:
      -h, --help  show this help message and exit
      --env-help  Display help message for overriding environment variables
    $ dx pwd -h
    usage: dx pwd [-h] [--env-help]
    
    Print current working directory
    
    options:
      -h, --help  show this help message and exit
      --env-help  Display help message for overriding environment variables
    $ dx pwd
    demo_project:/
    $ dx ls -h
    usage: dx ls [-h] [--color {off,on,auto}] [--delimiter [DELIMITER]]
                 [--env-help] [--brief | --verbose] [-a] [-l] [--obj] [--folders]
                 [--full]
                 [path]
    
    List folders and/or objects in a folder
    
    positional arguments:
      path                  Folder (possibly in another project) to list the
                            contents of, default is the current directory in the
                            current project. Syntax: projectID:/folder/path
    usage: dx cp [-h] [--env-help] [-a] source [source ...] destination
    
    Copy objects and/or folders between different projects.  Folders will
    automatically be copied recursively.  To specify which project to use as a
    source or destination, prepend the path or ID of the object/folder with the
    project ID or name and a colon.
    
    EXAMPLES
    
      The first example copies a file in a project called "FirstProj" to the
      current directory of the current project.  The second example copies the
      object named "reads.fq.gz" in the current directory to the folder
      /folder/path in the project with ID "project-B0VK6F6gpqG6z7JGkbqQ000Q",
      and finally renaming it to "newname.fq.gz".
    
      $ dx cp FirstProj:file-B0XBQFygpqGK8ZPjbk0Q000q .
      $ dx cp reads.fq.gz project-B0VK6F6gpqG6z7JGkbqQ000Q:/folder/path/newname.fq.>
    
    positional arguments:
      source       Objects and/or folder names to copy
      destination  Folder into which to copy the sources or new pathname (if only
                   one source is provided).  Must be in a different
                   project/container than all source paths.
    
    options:
      -h, --help   show this help message and exit
      --env-help   Display help message for overriding environment
                   variables
      -a, --all    Apply to all results with the same name without
                   prompting
    $ dx cp project-BQpp3Y804Y0xbyG4GJPQ01xv:file-GFz5xf00Bqx2j79G4q4F5jXV /
    $ dx ls
    hs38DH.dict
    $ dx ls -l
    Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
    Folder : /
    State   Last modified       Size      Name (ID)
    closed  2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)
    $ dx ls
    data/
    hs38DH.dict
    dx mv file-GFz5xf00Bqx2j79G4q4F5jXV data
    dx mv hs38DH.dict data
    $ dx ls
    data/
    $ dx ls data
    hs38DH.dict
    $ dx pwd
    demo_project:/data
    $ dx ls -l
    Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
    Folder : /data
    State   Last modified       Size      Name (ID)
    closed  2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)
    $ dx tree -h
    usage: dx tree [-h] [--color {off,on,auto}] [--env-help] [-a] [-l] [path]
    
    List folders and objects in a tree
    
    positional arguments:
      path                  Folder (possibly in another project) to list the
                            contents of, default is the current directory in the
                            current project. Syntax: projectID:/folder/path
    
    options:
      -h, --help            show this help message and exit
      --color {off,on,auto}
                            Set when color is used (color=auto is used when stdout
                            is a TTY)
      --env-help            Display help message for overriding environment
                            variables
      -a, --all             show hidden files
      -l, --long            use a long listing format
    $ dx tree
    .
    └─ data
        └─ hs38DH.dict
    $ dx tree -l
    .
    └─ data
        └─ closed  2023-07-07 16:11:56 334.68 KB hs38DH.dict
                   (file-GFz5xf00Bqx2j79G4q4F5jXV)
    $ echo hello > hello.txt
    $ dx upload -h
    usage: dx upload [-h] [--visibility {hidden,visible}] [--property KEY=VALUE]
                     [--type TYPE] [--tag TAG] [--details DETAILS] [-p]
                     [--brief | --verbose] [--env-help] [--path [PATH]] [-r]
                     [--wait] [--no-progress] [--buffer-size WRITE_BUFFER_SIZE]
                     [--singlethread]
                     filename [filename ...]
    
    Upload local file(s) or directory. If "-" is provided, stdin will be used
    instead. By default, the filename will be used as its new name. If
    --path/--destination is provided with a path ending in a slash, the filename
    will be used, and the folder path will be used as a destination. If it does not
    end in a slash, then it will be used as the final name.
    
    positional arguments:
      filename              Local file or directory to upload ("-" indicates stdin
                            input); provide multiple times to upload multiple files
                            or directories
    $ dx ls
    data/
    hello.txt
    $ dx cat -h
    usage: dx cat [-h] [--env-help] [--unicode] path [path ...]
    
    positional arguments:
      path        File ID or name(s) to print to stdout
    
    options:
      -h, --help  show this help message and exit
      --env-help  Display help message for overriding environment variables
      --unicode   Display the characters as text/unicode when writing to stdout
    $ dx cat hello.txt
    hello
    $ dx upload hello.txt --path data
    $ dx ls
    data/
    data
    hello.txt
    $ dx cat data
    hello
    If --path/--destination is provided with a path ending in a slash, the
    filename will be used, and the folder path will be used as a destination.
    If it does not end in a slash, then it will be used as the final name.
    $ dx ls -l
    Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
    Folder : /
    data/
    State   Last modified       Size      Name (ID)
    closed  2023-07-07 16:34:31 6 bytes   data (file-GXZB2180fF65j2G1197pP7By)
    closed  2023-07-07 16:34:10 6 bytes   hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
    $ dx ls -l
    Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
    Folder : /
    data/
    State   Last modified       Size      Name (ID)
    closed  2023-07-07 17:01:20 6 bytes   hello.txt (file-GXZBKYQ0fF6Pf2ZKPBF7G7j9)
    closed  2023-07-07 16:34:10 6 bytes   hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
    $ dx rm hello.txt
    The given path "hello.txt" resolves to the following data objects:
    0) closed  2023-07-07 17:01:20 6 bytes   hello.txt (file-GXZBKYQ0fF6Pf2ZKPBF7G7j9)
    1) closed  2023-07-07 16:34:10 6 bytes   hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
    
    Pick a numbered choice or "*" for all: 0
    $ dx head -h
    usage: dx head [-h] [--color {off,on,auto}] [--env-help] [-n N] path
    
    Print the first part of a file. By default, prints the first 10 lines.
    
    positional arguments:
      path                  File ID or name to access
    
    options:
      -h, --help            show this help message and exit
      --color {off,on,auto}
                            Set when color is used (color=auto is used when stdout
                            is a TTY)
      --env-help            Display help message for overriding environment
                            variables
      -n N, --lines N       Print the first N lines (default 10)
    $ dx head data/hs38DH.dict
    @HD VN:1.6
    @SQ SN:chr1 LN:248956422    M5:6aef897c3d6ff0c78aff06ac189178dd UR:file:/home/hs38DH.fa.gz
    @SQ SN:chr2 LN:242193529    M5:f98db672eb0993dcfdabafe2a882905c UR:file:/home/hs38DH.fa.gz
    @SQ SN:chr3 LN:198295559    M5:76635a41ea913a405ded820447d067b0 UR:file:/home/hs38DH.fa.gz
    @SQ SN:chr4 LN:190214555    M5:3210fecf1eb92d5489da4346b3fddc6e UR:file:/home/hs38DH.fa.gz
    @SQ SN:chr5 LN:181538259    M5:a811b3dc9fe66af729dc0dddf7fa4f13 UR:file:/home/hs38DH.fa.gz
    @SQ SN:chr6 LN:170805979    M5:5691468a67c7e7a7b5f2a3a683792c29 UR:file:/home/hs38DH.fa.gz
    @SQ SN:chr7 LN:159345973    M5:cc044cc2256a1141212660fb07b6171e UR:file:/home/hs38DH.fa.gz
    @SQ SN:chr8 LN:145138636    M5:c67955b5f7815a9a1edfaa15893d3616 UR:file:/home/hs38DH.fa.gz
    @SQ SN:chr9 LN:138394717    M5:6c198acf68b5af7b9d676dfdd531b5de UR:file:/home/hs38DH.fa.gz
    $ dx download file-GFz5xf00Bqx2j79G4q4F5jXV
    [===========================================================>]
    Downloaded 342,714
    [===========================================================>]
    Completed 342,714 of 342,714 bytes (100%) /Users/[email protected]/work/academy/hs38DH.dict
    $ dx describe -h
    usage: dx describe [-h] [--json] [--color {off,on,auto}]
                       [--delimiter [DELIMITER]] [--env-help] [--details]
                       [--verbose] [--name] [--multi]
                       path
    
    Describe a DNAnexus entity.  Use this command to describe data objects by name
    or ID, jobs, apps, users, organizations, etc.  If using the "--json" flag, it
    will thrown an error if more than one match is found (but if you would like a
    JSON array of the describe hashes of all matches, then provide the "--multi"
    flag).  Otherwise, it will always display all results it finds.
    
    NOTES:
    
    - The project found in the path is used as a HINT when you are using an object ID;
    you may still get a result if you have access to a copy of the object in some
    other project, but if it exists in the specified project, its description will
    be returned.
    
    - When describing apps or applets, options marked as advanced inputs will be
    hidden unless --verbose is provided
    
    positional arguments:
      path                  Object ID or path to an object (possibly in another
                            project) to describe.
    
    options:
      -h, --help            show this help message and exit
      --json                Display return value in JSON
      --color {off,on,auto}
                            Set when color is used (color=auto is used when stdout
                            is a TTY)
      --delimiter [DELIMITER], --delim [DELIMITER]
                            Always use exactly one of DELIMITER to separate fields
                            to be printed; if no delimiter is provided with this
                            flag, TAB will be used
      --env-help            Display help message for overriding environment
                            variables
      --details             Include details of data objects
      --verbose             Include additional metadata
      --name                Only print the matching names, one per line
      --multi               If the flag --json is also provided, then returns a JSON
                            array of describe hashes of all matching results
    $ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v
    Result 1:
    ID                          file-GXZB1v80fF6BXJ8p7PvZPy1v
    Class                       file
    Project                     project-GXZ90x00fF6F4fy1K20x4gv9
    Folder                      /
    Name                        hello.txt
    State                       closed
    Visibility                  visible
    Types                       -
    Properties                  -
    Tags                        -
    Outgoing links              -
    Created                     Fri Jul  7 16:34:09 2023
    Created by                  kyclark
    Last modified               Fri Jul  7 16:34:10 2023
    Media type                  text/plain
    archivalState               "live"
    Size                        6 bytes
    cloudAccount                "cloudaccount-dnanexus"
    $ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v --delim ,
    Result 1:
    ID,file-GXZB1v80fF6BXJ8p7PvZPy1v
    Class,file
    Project,project-GXZ90x00fF6F4fy1K20x4gv9
    Folder,/
    Name,hello.txt
    State,closed
    Visibility,visible
    Types,-
    Properties,-
    Tags,-
    Outgoing links,-
    Created,Fri Jul  7 16:34:09 2023
    Created by,kyclark
    Last modified,Fri Jul  7 16:34:10 2023
    Media type,text/plain
    archivalState,"live"
    Size,6 bytes
    cloudAccount,"cloudaccount-dnanexus"
    $ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v --json
    {
        "id": "file-GXZB1v80fF6BXJ8p7PvZPy1v",
        "project": "project-GXZ90x00fF6F4fy1K20x4gv9",
        "class": "file",
        "sponsored": false,
        "name": "hello.txt",
        "types": [],
        "state": "closed",
        "hidden": false,
        "links": [],
        "folder": "/",
        "tags": [],
        "created": 1688772849000,
        "modified": 1688772850572,
        "createdBy": {
            "user": "user-kyclark"
        },
        "properties": {},
        "details": {},
        "media": "text/plain",
        "archivalState": "live",
        "size": 6,
        "cloudAccount": "cloudaccount-dnanexus"
    }
    $ dx describe project-GXZ90x00fF6F4fy1K20x4gv9 | head
    Result 1:
    ID                          project-GXZ90x00fF6F4fy1K20x4gv9
    Class                       project
    Name                        demo_project
    Summary
    Billed to                   org-sos
    Access level                ADMINISTER
    Region                      aws:us-east-1
    Protected                   false
    Restricted                  false
    $ dx mv -h
    usage: dx mv [-h] [--env-help] [-a] source [source ...] destination
    
    Move or rename data objects and/or folders inside a single project.  To copy
    data between different projects, use 'dx cp' instead.
    
    positional arguments:
      source       Objects and/or folder names to move
      destination  Folder into which to move the sources or new pathname (if only
                   one source is provided).  Must be in the same project/container
                   as all source paths.
    
    options:
      -h, --help   show this help message and exit
      --env-help   Display help message for overriding environment
                   variables
      -a, --all    Apply to all results with the same name without
                   prompting
    $ dx ls -l
    Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
    Folder : /
    data/
    State   Last modified       Size      Name (ID)
    closed  2023-07-10 10:11:31 6 bytes   goodbye.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
    $ dx mv file-GXZB1v80fF6BXJ8p7PvZPy1v data/hello.txt
    $ dx tree -l
    .
    └── data
        ├── closed  2023-07-10 10:13:31 6 bytes   hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
        └── closed  2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)
    $ dx cp hello.txt data/hello_copy.txt
    dxpy.exceptions.DXCLIError: A source path and the destination path resolved
    to the same project or container. Please specify different source and
    destination containers, e.g.
    dx cp source-project:source-id-or-path dest-project:dest-path
    usage: dx find data [-h] [--brief | --verbose] [--json]
                        [--color {off,on,auto}] [--delimiter [DELIMITER]]
                        [--env-help] [--property KEY[=VALUE]] [--tag TAG]
                        [--class {record,file,applet,workflow,database}]
                        [--state {open,closing,closed,any}]
                        [--visibility {hidden,visible,either}] [--name NAME]
                        [--type TYPE] [--link LINK] [--all-projects]
                        [--path PROJECT:FOLDER] [--norecurse]
                        [--created-after CREATED_AFTER]
                        [--created-before CREATED_BEFORE] [--mod-after MOD_AFTER]
                        [--mod-before MOD_BEFORE] [--region REGION]
    
    Finds data objects subject to the given search parameters. By default,
    restricts the search to the current project if set. To search over all
    projects (excluding public projects), use --all-projects (overrides --path and
    --norecurse).
    $ dx find data
    closed  2023-07-10 10:13:31 6 bytes   /data/hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
    closed  2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)
    $ dx find data --name hs38DH.dict
    closed  2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)
    $ dx find data --name "h*"
    closed  2023-07-10 10:13:31 6 bytes   /data/hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
    closed  2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)
    $ dx find data --name \*.dict
    closed  2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)
    $ dx find data --name \*.dict --brief
    project-GXZ90x00fF6F4fy1K20x4gv9:file-GFz5xf00Bqx2j79G4q4F5jXV
    $ dx download $(dx find data --name \*.dict --brief)
    [=======================>] Completed 342,714 of 342,714 bytes (100%)
                               /Users/[email protected]/work/academy/hs38DH.dict
    $ dx find data --name \*.dict --json
    [
        {
            "project": "project-GXZ90x00fF6F4fy1K20x4gv9",
            "id": "file-GFz5xf00Bqx2j79G4q4F5jXV",
            "describe": {
                "id": "file-GFz5xf00Bqx2j79G4q4F5jXV",
                "project": "project-GXZ90x00fF6F4fy1K20x4gv9",
                "class": "file",
                "name": "hs38DH.dict",
                "state": "closed",
                "folder": "/data",
                "modified": 1688771516882,
                "size": 342714
            }
        }
    ]
    $ dx find apps --name "sra*"
    x SRA FASTQ Importer (sra_fastq_importer), v4.0.0
    $ dx run sra_fastq_importer -h
    usage: dx run sra_fastq_importer [-iINPUT_NAME=VALUE ...]
    
    App: SRA FASTQ Importer
    
    Version: 4.0.0 (published)
    
    Download SE or PE reads in FASTQ or FASTA format from SRA using SRR accessions
    
    See the app page for more information:
      https://platform.dnanexus.com/app/sra_fastq_importer
    
    Inputs:
      dbGaP Repository key: [-ingc_key=(file)]
            (Optional) Security token required for configuring NCBI SRA toolkit and decryption tools.
    
      SRR Accession: -iaccession=(string)
            Single SRR accession to fetch.
    $ dx run sra_fastq_importer -iaccession=SRR070372
    
    Using input JSON:
    {
        "accession": "SRR070372"
    }
    
    Confirm running the executable with this input [Y/n]: y
    Calling app-G49BFZ093qKvjFYgF8fyv6Z7 with output destination project-GXY0PK0071xJpG156BFyXpJF:/
    
    Job ID: job-GXf8Qg8071xBJJg417YVYJX3
    Watch launched job now? [Y/n] y
    * SRA FASTQ Importer (sra_fastq_importer:main) (done)
      job-GXf8Qg8071xBJJg417YVYJX3
      kyclark 2023-07-10 15:38:21 (runtime 0:02:36)
      Output: single_reads_fastq = [ file-GXf8VgQ09bzK5q1XV5z1gx7j ]
    $ dx ls -l file-GXf8VgQ09bzK5q1XV5z1gx7j
    closed  2023-07-10 15:41:38 206.59 MB SRR070372.fastq.gz (file-GXf8VgQ09bzK5q1XV5z1gx7j)
    $ dx find apps --name fastqc
    x FastQC Reads Quality Control (fastqc), v3.0.3
    usage: dx run fastqc [-iINPUT_NAME=VALUE ...]
    
    App: FastQC Reads Quality Control
    
    Version: 3.0.3 (published)
    
    Generates a QC report on reads data
    
    See the app page for more information:
      https://platform.dnanexus.com/app/fastqc
    
    Inputs:
      Reads: -ireads=(file)
            A file containing the reads to be checked. Accepted formats are
            gzipped-FASTQ and BAM.
    $ dx run fastqc -ireads=file-GXf8P880FjgZGJQqx8Bf30YK -y --watch
    
    Using input JSON:
    {
        "reads": {
            "$dnanexus_link": "file-GXf8P880FjgZGJQqx8Bf30YK"
        }
    }
    
    Calling app-G81jg5j9jP7qxb310vg2xQkX with output destination project-GXY0PK0071xJpG156BFyXpJF:/
    
    Job ID: job-GXf8fJQ071x00P5bQzQ62gjY
    $ cat input.json
    {
        "reads": {
            "$dnanexus_link": "file-GXf8P880FjgZGJQqx8Bf30YK"
        }
    }
    $ dx run fastqc -f input.json -y --brief
    job-GXf930j071xJfYqfJ2kkvk8v
    * FastQC Reads Quality Control (fastqc:main) (done) job-GXf8fgj071x3KV4qyyKGZQVY
      kyclark 2023-07-10 15:51:11 (runtime 0:02:01)
      Output: report_html = file-GXf8gbQ06GxZ38zFXB46XYYj
              stats_txt = file-GXf8gbj06Gxy9F8P66pJG7J3
    $ dx head file-GXf8gbj06Gxy9F8P66pJG7J3
    ##FastQC    0.11.9
    >>Basic Statistics    pass
    #Measure    Value
    Filename    SRR070372.fastq.gz
    File type   Conventional base calls
    Encoding    Sanger / Illumina 1.9
    Total Sequences 498843
    Sequences flagged as poor quality   0
    Sequence length 48-2044
    %GC 39