Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Welcome to DNAnexus Academy's online guidebook! This resource is designed for educational purposes to provide you with a foundational understanding of how to utilize DNAnexus for performing analyses. Please note that this guide does not aim to instruct you on every aspect of using the platform, nor does it suggest that this is the only method for leveraging DNAnexus solutions. Instead, it serves as an instructional tool with examples designed to help you begin your journey.
Included in this documentation are guides to assist with your projects, including videos, and content for the terms and concepts that we think are important for your understanding. There are also walk-through examples to get you comfortable on the platform.
As-Is Software Disclaimer This content in this repository is delivered “As-Is”. Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to materials provided hereunder.
If you are new to the DNAnexus platform and computational biology/ bioinformatics, these sections are recommended for you:
Background InformationGeneral InformationCloud Computing for ScientistsOverview of the PlatformFor Titan UsersFor Apollo UsersWelcome to DNAnexus!
Before you go through the information here, there is necessary information that we think will be useful for you to have.
Some of the users of the platform have limited coding experience. As bioinformaticians and computational biologists, we are members of a community that want to help alleviate that stress. On this page, we attached some helpful links and tutorials that will hopefully help make the world of computational biology a bit less intimidating. This is not a partnership or affiliation, but rather a list of what we found useful when we were learning ourselves.
Additionally, users may need resources on the different types of sequencing and the impacts, and we have some here for the ever evolving field of genetics/ genomics. Again, these are not endorsing any particular company, lab, or resource, but instead more of a general guide to help fill in the gaps.
Please note, the data present in this page is synthetic data, and is intended for training purposes only. Information about the data present in this documentation is listed.
When germline variant data is present in your data ingestion for the cohort, the Germline Variants tab will appear in the Cohort Browser. The goal of viewing data within the Germline Variants tab is to view germline mutations in genes or genomic regions of interest.
To filter with phenotypic data, you can filter from the tiles that you added in the “Overview” tab, or through the “+ Add Filter” button in the Cohort Banner. These filters allow for assessing the impact of phenotypic/ clinical data and the creation of cohorts.
1. In the Cohort section, select the “+ Add Filter”
2. Search or select your characteristic. Ex: Diagnoses, Tumor Details > Tumor Disease Anatomic Site
If you are a Titan user, these sections are recommended for you:
Any background information that could be necessary are listed in the For HPC or For Scientists pages to get you started there as well.
Does the software utilize multiple cores?
mem2_ssd1_v2_x16
Is the software GPU optimized?
mem2_ssd1_gpu_x32
How much memory does the software use (per core)?
mem2_ssd1_v2_x16
How much disk space is needed for the software (per core)?
mem2_ssd1_v2_x16
Always use version 2 of an instance type!
mem2_ssd1_v2_x16
Each class (like mem1) is scaled so that each core in an instance has access to the same amount of memory/disk space:
Example: mem1_ssd1_v2_x2:
4 Gb total memory / 2 cores =
2 Gb / Core
Example: mem1_ssd1_v2_x8:
16 Gb total memory / 8 cores =
2 Gb / core
Scale usage/instance type according to usage statistics and dataset size
If it doesn't utilize all resources
Use a smaller instance type
Runs out of memory, or is slow
Consider using a larger instance type
Each stage of a workflow is run by a different set of workers
Each stage can be customized in terms of instance type
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

If you are an HPC user new to the DNAnexus platform , these sections are recommended for you:
Background InformationGeneral InformationFor HPC UsersOverview of the PlatformCommand Line Interface (CLI)JSONFor Titan UsersFor Apollo UsersIf you are an experienced user new to the DNAnexus platform, these sections are recommended for you:
For Titan UsersFor Apollo UsersJSONDockerIf you are an Apollo user, these sections are recommended for you:
Overview of the PlatformBilling Access and OrgsCommand Line Interface (CLI)Cohort BrowserJupyterLabAny background information that could be necessary are listed in the For HPC or For Scientists pages to get you started there as well.
In this section, you will build the same applet examples from bash and Python as tasks, and then graduate to building workflows by chaining tasks together.
Workflows are a set of 2 or more apps that are linked together by dependencies, or when the output of one app/ applet is the input to another app/applet. A workflow will allow for these apps to be ran after the dependencies are met without having to submit another job (unless there is an error).
We support the following options for building workflows: * Native (GUI) * WDL * Nextflow
In order to kill a job/ workflow/ app/applet you will need to terminate the job/ analysis. Please use dx terminate or terminate in the Monitor tab in the UI.
All the same examples from bash now in Python.
Within the Germline Variants Tab, there are the following sections: the search bar for genes by gene symbol and genomic ranges, the Allele Frequency Lollipop Plot, and the Allele Table with the germline mutations that are present in the lollipop plot above. The tables and figures of the Germline Variants Tab are highlighted in the figure below:
The first figure that is shown on the tab is the lollipop plot. The x axis is the position of the mutation, and the y axis is the allele frequency. You can search for the genomic range by Gene Symbol, Genomic Range, or rsID. The lollipop plot and allele table will be updated once you search for the new genomic range.
The second figure that is shown on the tab is the Allele Table. The columns available are the location (defined by chromosome and position), rsID, Reference and Alternate nucleotide, Type of Mutation, Consequence, Cohort AF (Allele Frequency), Population AF, and GnomAD AF. You can search for the genomic range by Gene Symbol, Genomic Range, or rsID. The lollipop plot (described above) and allele table will be updated once you search for the new genomic range.
4. Make sure "Is Any of" is selected, click on empty field
5. Select details for the characteristic. Ex: selecting Ovary
6. Your cohort panel will then look like this:
Repeat steps as necessary to filter as needed to create your cohort
In this example, we are going to create 2 different filters. One will be a sample with where the tumor disease anatomic site is the ovary, and another where the site is breast.
If in the cohort filter we select the tumor disease anatomic site “is ovary” AND tumor disease anatomic site “is breast”, then we have zero patients.
This is seen the in the figure below:
Instead, we would need to change this to “OR” by pressing the “AND With” portion of the filter.
Now, we have a filter that has the tumor disease anatomic site as the ovary or the breast.


Please note, the data present in this page is intended for training purposes only. Information about the data present in this documentation is listed here.
The overview tab is dedicated to the phenotypic data that has been ingested in your dataset. The phenotypic data is able to be displayed utilizing tiles, and these tiles will have different tables or figures based on their data type.
Open Cohort Browser
Select "+ Add Tile" on the top right corner
Find the characteristic you want as a tile and select "Add Tile"
Repeat until you select the amount of tiles that you are wanting (up to 15)
Used for more advanced comparisons
Add comparisons by selecting the first filter, then selecting the "+" sign for a secondary field
Then, edit the data field details
Here is the overview of the 2D plots that are available based on data types:
Open Cohort Browser
Select Add tile on the top right corner
Find the characteristic you are wanting to start with and select it, such as biological sex. This is the same step as adding a regular tile, but you will NOT select Add tile.
Instead, add a secondary field by selecting this next to the second characteristic you are wanting to view.
You will then have options to change the graph with those parameters.
Then, select the add tile button on the bottom right below the new graph. This will add it to the cohort browser.
Limited to 15 tiles overall in dashboard
Limited to 30 columns in Data Preview
Add 1-2 tiles at a time, wait for them to refresh before adding more tiles.
Billing occurs monthly based on your use of the platform. These invoices are received at the end of the month
The relationship of DNAnexus and billing are highlighted here:
Regions and Pricing can be referred to as the "Rate Card"
These are negotiated at the time of signing
This is the area of expertise of the DNAnexus Sales Account Director. For further details on this, please refer to them.
This can be useful for everyone else to decide the instances that you choose to run on the platform.
Job Errors happen
Some of which are charged to you
Some of which are not
Error details are found in our
Orgs can be used to consolidate and simplify billing.
An order can be associated with a billing account. This allows all users of the org to build projects and apps to the org billing account.
If you have a project bill to an org, this is useful. Say if you have users within a group or users within a particular lab, that are working with a shared budget, where each member needs to have the ability to work independently within their own project.
By associating a billing account with an org, this allows groups with a shared budget to consolidate all platform activities onto one invoice.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Your Computer: When we utilize cloud resources, we as users request them from our own computer using commands from the dx toolkit.
DNAnexus platform: The platform has many working pieces, but we can treat it as one entity here. Our request gets sent to the platform, and given availability, it will grant access to a temporary DNAnexus Worker.
DNAnexus Worker: This temporary worker is the third key player and is where we do our computation on. We'll see that it starts out as a blank slate.
A project contains files and executables and logs associated with analysis securely stored on the platform
The executables on the platform are referred to as apps. Apps are executables that can be run on the DNAnexus platform. Most importantly, they need to contain a software environment to run the executable.
A software environment in general is everything needed to run software on a brand new computer. This includes the software itself that you are needing as well as any dependencies that are needed to run the software. Some examples of dependencies are languages (such as R) that are needed to execute the software.
Project storage is permanent, but the workers are temporary. This means that you have to relay information back and forth as shown in the figure below.
The key concept with cloud computing: project storage can be considered as permanent on the platform. Note that workers are temporary. Because workers are temporary, we need to transfer the files we want to process to them. When we are done, we need to transfer any output files back to the project storage. If we don't do this, the files will be lost when we lose access to the worker.
On your local computer, everything is on your machine.
This includes your data and the scripts, as well as your software environment and dependencies are also downloaded.
The results and in between steps are also generated and saved on your machine as well.
You own it and you control it.
In comparison, cloud computing adds layers into analysis to increase computational power and storage.
This relationship and the layers involved are in the figure below:
Let's contrast this with the process of processing a file on the DNAnexus platform.
The first difference is that we need to request a worker and we only have temporary access to it. We need to bring everything to the worker, including the software environment.
The second key difference is that we need to bring our files and scripts from project storage to the worker.
Our first barrier is requesting an appropriate worker that can do our computational job.
For example, our app may require more memory, or if it is optimized for working on multiple CPUs, more CPUs.
We need to understand how big our files are and the computing requirements of our software to do this.
Our second barrier is installing the software environment on the worker, such as R.
Because we are starting from scratch on a worker, we will need ways to reproducibly install the software environment on the worker.
We'll see that this is one of the roles of Apps. As part of their job, they will install the appropriate software environment.
There is some good news. If we are running apps, they will handle both of these barriers.
Number one, all apps have a default instance type to use. We'll see that we can tailor this.
Secondly, Apps install the required software environment on their workers.
Our third barrier is getting our files onto the worker from project storage, and then doing computations with them on the worker. The last barrier we'll talk about is getting the file outputs we've generated from the worker back into the project storage.
Cloud computing has a nestedness to it and transferring files back and forth can make learning it difficult.
Having a mental model of how cloud computing works can help us overcome these barriers.
Cloud computing is indirect, and you need to think 2 steps ahead.
Here is the visual for thinking about the steps for file management:
Apps help you address installing software on worker
Prebuilt software environment that is installed onto the temporary worker
Can build our own apps
Apps serve to (at minimum):
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
To filter with gene expression data, you can add a filter based on the tiles created in the Gene Expression tab or use the “+ Add Filter” button in the Cohort Banner.
Assessing impact of genes/ features and their expression levels
Building Cohorts based on Gene Expression Level
Gene Symbol or Ensembl ID with Expression Level
Add in your dataset
Select "+ Add Filter"
Select Assays and then under Gene Expression, select “Features/ Expression”
Select the genes that you want as well as the expression range. Please note, for the Gene/ Feature value, you can select by Gene Symbol or the ENSEMBL ID.
Please note: in order to use Cohort Browser on the Platform, an Apollo License is needed.
The cohort browser is used for browsing and visualizing data and creating cohorts. These cohorts can then be shared in a project space to your collaborators.
Projects have a series of features designed to facilitate collaboration, help project members coordinate and organize their work, and ensure appropriate control over both data and tools.
All work takes place in the context of a project. Projects allow a defined set of users and orgs to:
Access specific data
Please note, the data present is intended for training purposes only. Information about the data present in this documentation is listed .
When somatic variant data is present in your data ingestion for the cohort, the Somatic Variants tab will appear in the Cohort Browser. The goal of viewing data within the Somatic Variants tab is to view somatic mutations present in your data, and to explore variants and events for certain genomic regions. You can also compare these values within 2 different cohorts, as long as they have the same underlying database.
In order for Nextflow to run correctly on the platform, please do the following:
Install dxpy/ dx-toolkit. Details on how to do this is in the Command Line Interface Section under Introduction to the CLI.
As Nextflow on DNAnexus is being updated with bugfixes and improvements on a regular basis, we recommend updating dxpy to the latest version prior to building your Nextflow applet.
You can upgrade dxpy by using the following
To upload your files, you will need to do the following:
create a folder with the org name for the portal. It will be org-NAME OF COMMUNITY
Make sure all of your json files are in the folder
Make sure all of your assets/ images are in the folder.
In this section, we will build several native bash applets that will increase in complexity:
An applet that takes an input files, runs a single Unix command, and returns the result as a file.
An applet that includes a binary executable file in the resources directory.
An applet that installs the dependency cnvkit
An applet that runs samtools.
In order to kill a job/ workflow/ app/applet you will need to terminate the job/ analysis. Please use dx terminate or terminate in the Monitor tab in the UI.







This is great, but limited by how much storage and computational power that you have on your local machine.
This is highlighted in the figure below:
We first start out by using the dx run command, requesting to run an app on a file in project storage. This request is then sent to the platform, and an appropriate worker from the pool of workers is made available.
When the worker is available, we can transfer a file from the project to the worker.
The platform handles installing the app and its software environment to the worker as well.
Once our app is ready and our file is set, we can run the computation on the worker.
Any files that we generate must be transferred back into project storage.
Request a worker (Challenge 1)
Configure the worker's environment (Challenge 2)
Establish data transfer (Challenge 3)
Running apps are covered throughout the rest of the documentation.










has to be .json, .png, or .jpg
Then,
Ensure that you have md5 and jq downloaded
Ensure that you have the manage_community_assets.sh script (this is already provided to you when you have a license for the portal)
Finally,
Run one of the following lines of code
To upload or update the portal assets:
to delete the portal assets:
Remember to clear your browser cache after updating the portal assets.
Please email [email protected] to create a support ticket if there are technical issues.
bash manage_community_assets.sh path/to/org-org_name 2bash manage_community_assets.sh path/to/org-org_name 1Here is an image of what a rate card looks like, and what each of the sections mean. The details of the rate card are subject to change
If you cannot access the rate card or are not an org admin, please see Appendix A of your order form.
When a user makes a project billable to an account, a user assigns ownership of the project to this work.
And org admins in this case admins, A and D. They have the ability to oversee and discover all projects that are billed to the org and revoke permissions to a project build to an org


Apps and Applets
Workflows
Jobs
Analyses
Records
Each object receives their own unique ID
These can be file IDs or job IDs
These NEVER change
The same file can be uploaded multiple times into a project; different objects will be created. The platform DOES NOT overwrite a file. Instead, it creates a new file ID every time you upload it.
Metadata is essential to keep track of these files and their properties, since we cannot change the file ID.
The data objects can have have 2 different items of custom metadata that can be added at any point. They are:
Tags: which are words to describe the file format, genome, etc.
Examples: fastq, control, bam, vcf
Properties: are key/ value pairs that can be used to describe the file
Examples: sample_id, value= 001
Go to your project folder and find the file information. Identify the columns for the name, type/ class, and tags
You can do this in the overview of each, without selecting the file:
Or, you can do this by selecting each file individually to see a detailed view:
You can sort and organize data based on pre-established and custom metadata by selecting the “column” icon in the top right. Columns can also be sorted by hovering over the title.
You can filer by the metadata present in the project space. The options are drop down menus above the overview of the metadata headings.
When viewing details of a particular data object, you will have a section for the data operations of a file. These include archive, copy, delete, and download. These operations will vary based on the access permission that you have for a given project. You can see the data operations available in the image below:
Browse, explore, and analyze this data
Once you have access to the platform and an org that allows for billable activity, you can start working by creating a new project in the UI.
Navigate to the Projects list page, by selecting Projects in the UI from the main menu, then clicking All Projects.
Click the New Project button (highlighted in gold).
The New Project wizard will open in a modal window.
In the Project Name field (highlighted in light blue), enter a name for your project.
In the More Info section (highlighted in gold), add in the optional fields for sorting projects, such as
Tags
Properties
Project Summary
Project Description.
In the Billing Section (highlighted in navy), select the billed to org and Region.
In the Billed To field, choose an account to which project billable activities should be charged.
In the Region field, select your region if it is not already selected.
In the Usage Limits section (highlighted in chartreuse), select the optional compute usage limit and the egress limit. Please note, if you do not have this option and would like to, please contact our sales team at [email protected] or a member of our Success Team.
Compute Usage Limits are the monthly compute usage limit for a given project. This value is in USD ($).
Egress Usage Limits are the monthly egress limits for a given project. This value is in bytes.
In the Access section (highlighted in black), specify which users will be able to conduct data-related operations within the project.
Copy Access will limit who can copy data into other projects, or who can use the data as inputs in another projects. The options are All Members or No One.
Delete Access will limit who can delete the project. The options are Contributors and Admins or Admins Only.
Download Access will limit who can download data from the project. The options are All Members or No One.
Apps in a workflow will always begin executing as soon as their inputs are satisfied and if possible they will run independently.
Workflows can be created by clicking on the Add button and selecting the New Workflow.
This is what it will look like once you select "New Workflow"
Add the apps that you want for the workflow and order them where the dependencies are generated first
After that, add in the necessary requirements. They are featured below:
Select Start Analysis
You will have a "pre-flight" check to make sure everything that is needed is there. Once that is complete, select Start Analysis again and it will start to run.
You will be redirected once you have started the analysis.
The monitor has panels to show what is running, how long it took to complete, and the order they were done in.
You can view the information in order to see the details of the workflow.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Within the Somatic Variants Tab, there are the following sections: the Variant Frequency Matrix for the Cohort, a gene Lollipop Plot with search bar by Gene Symbol or Genomic Range, and the Variants and Events Table with the somatic mutations that are present in the lollipop plot above. The tables and figures of the Somatic Variants Tab are highlighted in the figure below:
A variant frequency matrix has the following features:
Genes are sorted (rows of the plot) in descending order of percent of affected samples.
Samples are sorted (columns of the plot) by the greatest number of mutated genes across all genes, independent of top mutated genes, in descending order.
Each Variant Frequency Matrix has a color scheme by consequence.
These features will also work while comparing cohorts.
You can also hover over the patient tiles individually for more information.
There are several options to view these Somatic Variant Frequency Matrices. You can see an overview of all of the somatic mutations, or a particular mutation type, such as Single Nucleotide Variants and Insertions/ Deletions (SNV and Indel), Structural Variants (SV), Copy Number Variants (CNV), and Fusions.
The first figure gives an overview of the top genes that are mutated in “All” categories, as shown below:
You can select the individual Variant Frequency Matrices in the drop down menu next to the heading “Variant Frequency Matrix”.
The options of the matrices are shown in the figure below. The options are SNV and Indel (top left), SV (top right), CNV (bottom left), and Fusions (bottom right).
A Lollipop Plot has the following features:
Only one gene / canonical protein can be viewed at a time
Each lollipop will be color coded by consequence
You are able to navigate to a particular Gene Symbol or Genomic range utilizing the search bar
You can select (click) a single amino acid change (one lollipop) to quickly filter the somatic variants table
Features also work while comparing cohorts
You can also hover over the patients tiles individually for more information
This is a tabular version of the data that you see in the Lollipop plot. You can quickly filter this data while using the lollipop plot (described above) or by filtering on any of the column headers in the table.
The version of dxpy that you use controls the version of the DNAnexus Nextflow executor and thus the version of Nextflow that is used for executing your pipeline.
Nextflow and dxpy versions
Most nfcore pipelines require versions of Nextflow starting with '23' and you will need to use a recent version of dxpy
For example an Nextflow applet built using v0.370.2 will have Nextflow version 23.10.0 bundled with it and it will use this version of Nextflow and v0.370.2 of dxpy for executing the Nextflow pipeline on the platform.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Exploratory Data Analysis
Gene Symbol or Genomic Range
Variant effect
Variant type
HGVS Notation
Variant IDs
Add in your dataset
Select "+ Add Filter"
Select Assays and then under Variant (Somatic), select “Genes/ Effects”
Select the genes/ impact/ variants that you want. Please note that the Genes/ Genomic Ranges will accept only Gene Symbols or genomic ranges.
It sets up the environment every time you utilize the snapshot. You do not need to manage dependencies every time you open a JupyterLab job if you utilize a snapshot.
Snapshots are saved in the .Notebook_Snapshots/ folder in the project space, and they have a .tar.gz file ending.
Snapshots are used in the input section when setting up the JupyterLab Job.
The input is highlighted in the figure below:
Don't save data in your snapshot - it uses storage space and impacts costs.
Snapshots can be large and take up storage space.
Make sure to rename the snapshot according to your organization's naming conventions: you can remember what they refer to when returning to the project in the future.
There is both worker related/ JupyterLab storage, as well as what is present in the Project storage. This is annotated in the figure below:
When you are running code blocks, remember that in JupyterLab you can run them out of order. This means that you need to pay attention to the numbers on the side of the code blocks for the order. This is highlighted in gold below:
If you choose to write in Python or R primarily, you can use the following at the top of your code block to "switch" to bash scripting. Example below








Please note, the data present is intended for training purposes only. Information about the data present in this documentation is listed here.
When gene expression data is present in your data ingestion for the cohort, the Gene Expression tab will appear in the Cohort Browser. The goal of viewing data within the Gene Expression tab is to view gene expression values in your cohort, and to compare between 2 cohorts within the same data base.
Within the Gene Expression Tab, there are the following sections: plots for Gene Expression, where you can search for genes by Gene Symbol or Ensembl ID, and an Expression per Feature table. The tables and figures of the Gene Expression Tab are highlighted in the figure below:
To view gene expression for a specific gene, type the gene symbol or Ensembl ID into the search bar for the charts labelled “Expression Level”. There are 3 options for the plots: Expression Level with a box plot, Expression Level with a histogram, and a Feature Correlation scatter plot between 2 genes. More than 3 tiles can be added with the “Add Tile” button, and typing in the Gene Symbol or Ensembl ID.
For the box plot, you can see the distribution of the expression level for a given gene by typing in the gene symbol or Ensembl ID to the search bar. You can view the detailed distribution as a violin plot, or as a box plot. The x axis is the distribution of gene expression levels in the cohort and the y axis is the Expression Level. The options to view the detailed distribution are part of the Chart Settings. The Bar Chart with the Violin Plot (detailed distribution) is shown below:
For the histogram, you can see the distribution of the expression level for a given gene by typing in the gene symbol or Ensembl ID to the search bar. You can see the histogram with or without the display statistics. The x axis is the distribution of gene expression levels in the cohort and the y axis is the Expression Level. The options to view the detailed distribution are part of the Chart Settings. The histogram with the display statistics settings are shown below:
For the feature correlation, you can see the expression level for a given gene for the x and y axis by typing in the gene symbol or Ensembl ID to the search bar. You can see the feature correlation with or without the display statistics. The x axis is the gene expression level for one gene and the y axis is the gene expression level for another gene. The options to view the detailed distribution are part of the Chart Settings. The feature correlation with the display statistics settings are shown below:
DNAnexus apps and applets are ways to package executable code. The biggest difference between apps and applets is their visibility. Apps such as you find in the Tool Library are globally available and maintained by DNAnexus and partners like Nvidia and Sentieon. Applets are private to an organization and exist as data objects in a project. They can be shared across projects and promoted to generally available apps. Native DNAnexus applets are built using dx build to create an executable for bash or Python code, which in turn may execute any program installed on the instance.
Later, we will discuss how to build a workflow, which is a combination of two or more apps/applets. We will build native workflows using the GUI and languages like WDL (Workflow Description Language) and Nextflow combine with Docker images.
As shown in following figure, the development cycle is to write code locally, use dx build to create a native applet on the platform, and then dx run to run the applet. You can view the execution logs with dx watch, then make changes to your code to build and run again.
To install the Python modules required for this tutorial, run the following command:
You may be prompted to expand PATH with installation directory such as ~/.local/bin:
Next, ensure you have a recent version of Java. For this tutorial, I'm using the following:
If you want to use to execute WDL locally, you should download the Cromwell Jar file. This tutorial assumes you will place this file in your home directory using the following commands.
I suggest you use the link command (ln) to create a symlink to the filename cromwell.jar so that upgrading in the future will not break your commands:
(Workflow Object Model) is also quite useful, and I suggest you similarly download it and link it to womtool.jar:
You will use the DNAnexus to build WDL applications on the platform. Find a link to the latest Jar file under the releases of the . For example, the following commands will download dxCompiler-2.10.4.jar to your home directory and symlink it to dxCompiler.jar:
Some tools may attempt to use the tool to validate any shell code in your WDL. To install on Ubuntu, run the following:
On macOS, you can use to install the program:
If the dxpy module installed properly, you should able to run dx on the command line. For instance, run dx all to see :
To get started, do the following:
Run to identify yourself to the DNAnexus platform. Enter your username and password. You can also set up a token to log in. Information on setting up tokens can be found in the section of our Documentation.
You may also be prompted to select a project. If not, you should use to select a project that will contain your work.
If you do not see a project you wish to use for your work, run dx new project to create one from the command line, or click "New Project" in the web interface.
Note that each subcommand will respond to the flags -h|--help to display the usage documentation. For instance, dx new can create several object types, which you can discover by reading the documentation:
You should now be prepared to develop DNAnexus apps and workflows.
Please email to create a support ticket if there are technical issues.
To filter with gene expression data, you can add a filter based on the tiles created in the Gene Expression tab or use the “+ Add Filter” button in the Cohort Banner.
Creating a more complex cohort with Phenotypic and genomic filtering
Phenotype/ Clinical data
Germline Variants
Somatic Variants
Gene Expression Changes
Add in your dataset
Select "Add Filter"
Choose the filtering that you are interested in. Details for Phenotype Filtering, Germline Filtering, Somatic Filtering, and Gene Expression are available in previous sections of the documentation. (These will have links).
Once the initial filter is complete, select “Add Additional Criteria” next to the filter, as shown below:
Repeat the process for the next cohort filter that you need.
You can start a TTYD job the same way as you would any other job in the UI.
Select Start Analysis in the top right corner in the project space.
Select the app called “ttyd”.
Select Next and then Start Analysis.
As the last step before launching the tool, you can review and confirm various runtime settings. Click on Launch Analysis. The job will be launched and you will be redirected to the Monitor tab in a few seconds.
In the Monitor tab, select the name of the ttyd job to view more details.
Once the state of the job switches to “Running”, you will be able to enter the ttyd with the “Open Worker URL” link in the top heading of the details page. If the page to which you get redirected says “502 Bad Gateway”, the worker is not yet fully initialized. Close the page, give it a few more minutes and try to open the worker URL again.
This will open a terminal in your browser that will give you access to the files in the DNAnexus project in which the app is running by mounting it in a read-only mode in the /mnt/project directory of the worker execution environment.
Once you are done with your work in ttyd, don’t forget to terminate the job by clicking the red button Terminate in the job’s details page.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.
TTYD
Cloud Workstation
Purpose
To have terminal access in your web browser
sets up a virtual workstation that lets you access and work with data stored on the DNAnexus Platform
Time Limits
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.
To filter with germline data, use the “+ Add Filter” button in the Cohort Banner.
Assessing impact of ingested variants in cohorts
Note: only non-ref variants are represented in the genomic data
Building Cohorts based on Variants
Develop basket studies based on your population
Exploratory Data Analysis before GWAS
Ask questions about co-occurrence with other mutations
Gene Symbol or region
Variant effect
Variant type
Variant ID
Add in your dataset
Select "+ Add Filter"
Select Assays and then under Genome Sequencing, select “Genes/ Effects”
Then, select the genes/ impact/ variants that you want. Please note, the filtering for the Genes/ Genomic regions is by Gene Symbol or Genomic range.
Disclaimer: Portals require a license. These documents are to get you started with your portals. By no means is this the only way to make your portal, nor is this the only way to edit a json file.
Each section of a portal has a different json file.
Here is a visual of which json file edits which section of a portal:
This section defines the following:
navigation/ header bar
items that are in the header after the logo that are also not included in the branding.json
You can also add/ delete navigation items
This section defines the following:
logo
colors
if you want a login page
a home URL attached to the logo
This controls the home page for the community portal
You can specify the following:
order of the sections
components
Please email to create a support ticket if there are technical issues.
JavaScript Object Notation
Common format for communicating with Application Program Interface (API)
Used to access DNAnexus API servers
Reading and modifying JSON is at the heart of building and running apps
Understanding JSON responses from the API will help you debug jobs
Automation and Batch submissions: running the same app on multiple files
Find which jobs have failed and why
A valid JSON document is enclosed in one of two data structures, either a list of values contained in square brackets:
Or an object composed of key/value pairs contained in curly brackets:
Example:
A JSON value may be any of the following:
single- or double-quoted string, e.g., "samtools" or 'file-G4x7GX80VBzQy64k4jzgjqgY'
integer, e.g. 19 or -4
float, e.g., 3.14 or 6.67384e-11
Lists are braced in square brackets [ ]
Similar to Python syntax
Used for multiple values separated by commas
Example:
An object starts and ends with curly braces
An object contains key/value pairs
Keys must be quoted strings
Values may be any JSON value, including another object
Example:
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
You also do not need to define executor like you might do for some other cloud Nextflow executors. By default the executor is 'local'. However, if you are for instance going to be running Nextflow in multiple locations and want different settings based on location you could set a DNAnexus profile in your nextflow.config which explicitly defines the executor and things like default queueSize.
Here is an example DNAnexus executor profile which also enables docker.
when running on DNAnexus you would then give '-profile dnanexus' to 'nextflow_run_opts' in the UI or in the CLI it would be -inextflow_run_opts='-profile dnanexus'
You could also create a test profile for testing on your own servers/cloud workstation and on DNAnexus.
If pipeline contains inputs from external sources (such as S3, FTP, HTTPS), those files are staged in the head-node and may run out of storage space (inputs sources from DNAnexus are not staged in this way).
The instance size of the head-node can be customized: in "Applet Settings" on the UI with the --instance-type flag on the CLI
20 sessions can be cached per project
The number of times any of those sessions can be resumed is unlimited
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.
You can collaborate on the DNAnexus Platform by giving project access to other users. Project access can be revoked at any time by a project administrator.
Once you've created a project, you can add members by doing the following:
From the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the project page.
This is a walk through of how to add an existing Docker Image to the platform and saving it as a snapshot file on the platform
To get started with this, you will either need to 1) open a ttyd or 2) have Docker installed and use you local terminal with dxtoolkit installed as well.
Disclaimer: Portals require a license. These documents are to get you started with your portals. By no means is this the only way to make your portal, nor is this the only way to edit a json file.
this .json file will personalize the banner that you navigate to different sections.
If you also have access to the ML JupyterLab (another solution in the AI/ML Accelerator Package), Data Profiler can be seamlessly opened in the JupyterLab environment, offering an intuitive and interactive tool for profiling multiple datasets directly within one workspace.
To get started, simply open an ML JupyterLab notebook, load the dataset, and profile it.
The integrated version of Data Profiler in ML JupyterLab (dxprofiler) offers four methods for loading your datasets to profile the data:
pip3 install --upgrade dxpyprofiles {
dnanexus {
executor {
name = 'local'
queueSize = 50
}
docker {
enabled = true
}
}
cluster {
executor {
name = 'sge'
memory = '20GB
}
}
}Have to manually terminate/ does not have an input for a time limit
Time limit is an input.
Snapshots
None
Can save snapshots
SSH
Does not need SSH Access
Does need SSH access
Common Uses
CLI operations and to launch https apps within the web browser
Used for analysis of platform data and testing applets, since the environment is what is opened when launching an app or applet
Sessions can be deleted to allow more, or development/running can be migrated to another project which will have its own 20-session limit Private S3 can be referenced by adding AWS scope to configs https://www.nextflow.io/docs/latest/amazons3.html?#aws-access-and-secret-keys





















descriptions
text
tables
images
reference material/ links (shown above)
links to DNAnexus projects (shown above)
featured tools

To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.
Finally, run dx ssh_config to set up SSH keys for connecting to cloud instances.

Run the failed jobs again
boolean, e.g., true or false
null
object
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Loading the dataset by a list of .csv or .parquet files.
Loading the dataset by Pandas dataframes ('patient_df' and 'clinical_df')
Loading the dataset by a record object (DNAnexus Dataset or Cohort). "project-xxxx:record-yyyy" is the ID of your Apollo Dataset (or Cohort) on the DNAnexus platform.
Once you finish profiling the dataset, here is the command to open the Data Profiler GUI:
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
python3 -m pip install dxpy miniwdlPATH=~/.local/bin:$PATH$ javac -version
javac 18cd ~
wget https://github.com/broadinstitute/cromwell/releases/download/84/cromwell-84.jarln -s cromwell-84.jar cromwell.jarcd ~
wget https://github.com/broadinstitute/cromwell/releases/download/84/womtool-84.jar
ln -s womtool-84.jar womtool.jarcd ~
wget https://github.com/dnanexus/dxCompiler/releases/download/2.10.5/dxCompiler-2.10.5.jar
ln -s dxCompiler-2.10.5.jar dxCompiler.jarsudo apt install shellcheckbrew install shellcheck$ dx all
usage: dx [-h] [--version] command ...
DNAnexus Command-Line Client, API v1.0.0, client v0.320.0
dx is a command-line client for interacting with the DNAnexus platform. You
can log in, navigate, upload, organize and share your data, launch analyses,
and more. For a quick tour of what the tool can do, see
https://documentation.dnanexus.com/getting-started/tutorials/cli-quickstart#quickstart-for-cli
For a breakdown of dx commands by category, run "dx help".
dx exits with exit code 3 if invalid input is provided or an invalid operation
is requested, and exit code 1 if an internal error is encountered. The latter
usually indicate bugs in dx; please report them at
https://github.com/dnanexus/dx-toolkit/issues
optional arguments:
-h, --help show this help message and exit
--env-help Display help message for overriding environment
variables
--version show program's version number and exit
dx: error: argument command: invalid choice: all
(choose from login, logout, exit, whoami, env, setenv, clearenv, invite,
uninvite, ls, tree, pwd, select, cd, cp, mv, mkdir, rmdir, rm, describe,
upload, download, make_download_url, cat, head, build, build_asset, add, list,
remove, update, install, uninstall, run, watch, ssh_config, ssh, terminate,
rmproject, new, get_details, set_details, set_visibility, add_types,
remove_types, tag, untag, rename, set_properties, unset_properties, close,
wait, get, find, api, upgrade, generate_batch_inputs,
publish, archive, unarchive, help)$ dx new -h
usage: dx new [-h] class ...
Use this command with one of the available subcommands (classes) to create a
new project or data object from scratch. Not all data types are supported. See
'dx upload' for files and 'dx build' for applets.
positional arguments:
class
user Create a new user account
org Create new non-billable org
project Create a new project
record Create a new record
workflow Create a new workflow
optional arguments:
-h, --help show this help message and exit[
{
"project": "project-Gg2QQx002Q7yY4kFQF7GKYPV",
"id": "applet-G1951vj0YyjJjbvGJ9FZB967",
"describe": {
"id": "applet-G1951vj0YyjJjbvGJ9FZB967",
"project": "project-Gg2QQx002Q7yY4kFQF7GKYPV"
}
},
{
"project": "project-Gg2QQx002Q7yY4kFQF7GKYPV",
"id": "file-GGy7Pbj0Xf47XZk125k22g9v",
"describe": {
"id": "file-GGy7Pbj0Xf47XZk125k22g9v",
"project": "project-Gg2QQx002Q7yY4kFQF7GKYPV"
}
}
]{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
},
"stats_txt": {
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
}
}{
"dnanexus-link": [
"file-G4x7GXQ0VBzZxFxz4fqV120B", "file-G4x7GX80VBzQy64k4jzgjqgY"
]
}{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
}
}dx run app-swiss-army-knife -iimage_file="gatk.tar.gz" -iin="data/mock.vcf" -icmd="gatk SelectVariants -V mock.vcf -O selected.snp.vcf –select-type-to-include "SNP""import dxprofiler
dataset = dxprofiler.profile_files(path_to_csv_or_parquet=['/path/to/table1.csv', '/path/to/table2.csv'], data_dictionary=None)import dxprofiler
dataset = dxprofiler.profile_dfs(dataframes={'patient_df': patient, 'clinical_df': clinical}, data_dictionary=None)import dxprofiler
dataset = dxprofiler.profile_files(path_to_csv_or_parquet='/path/to/tables/', data_dictionary=None)import dxprofiler
dataset = dxprofiler.profile_cohort_record(record_id="project-xxxx:record-yyyy")dataset.visualize()Type the username or the email address of an existing Platform user, or the ID of an org whose members you want to add to the project.
In the Access pulldown, choose the type of access the user or org will have to the project.
If you don't want the user to receive an email notification on being added to the project, click the Email Notification to "Off."
Click the Add User button.
Repeat Steps 2-5, for each user you want to add to the project.
Click Done when you're finished adding members.
To remove a user or org from a project to which you have ADMINISTER access:
1. On the project's Manage screen, click the Share Project button - the "two people" icon - in the top right corner of the page. A modal window will open, showing a list of project members.
2. Find the row showing the user you want to remove from the project.
3. Move your mouse over that row, then click the Remove from Members button at the right end of the row.
Access Level
Description
VIEW
Allows users to browse and visualize data stored in the project, download data to a local computer, and copy data to other projects.
UPLOAD
Gives users VIEW access, plus the ability to create new folders and data objects, modify the metadata of open data objects, and close data objects.
CONTRIBUTE
Gives users UPLOAD access, plus the ability to run executions directly in the project.
ADMINISTER
Gives users CONTRIBUTE access, plus the power to change project permissions and policies, including giving other users access, revoking access, transferring project ownership, and deleting the project.
Spark JupyterLab is ideal for extracting and interacting with the dataset or cohort.
Spark JupyterLab is NOT meant for downstream analysis.
Create a DX JupyterLab Notebook so that it will automatically save onto the Trusted Research Environment. You can do so by selecting these 2 different options:
Option 1 is from the Launcher:
b. Option 2 is from the DNAnexus Tab:
Start writing your JupyterLab Notebook. Select which kernel you are going to use (options will vary depending on the Image you selected in set up).
Download packages and save the software environment as a snapshot.
Download Packages
Save the Snapshot of the environment
Start writing your code.
Import Packages using import (at minimum, you will need dx data and pyspark)
b. Load the dataset with dx extract dataset
c. Initialize Spark
d. Retrieve data and cohorts that you are interested in
e. Upload Results back to Project Space
Save your DX Jupyterlab Notebook
Notebooks can also be directly opened from project storage
When you save in JupyterLab, the notebook gets uploaded to the platform as a new file. This goes back to the concept of immutability.
Old version of notebook goes into .Notebook_archive/ folder in project.
You have to pull the Docker image from the registry to the platform. For this example, the code is
That results in this view:
Notice that you will have extract and then pull complete on each of the "layers" of the image on the left hand side. This takes a few minutes depending on the size of the docker image
Now you have to save this docker image file. For this example, the code is
This again takes time depending on the size of your docker file.
Now you will need to upload this image back to the platform. For this example, the code is:
The last 2 steps have the following output:
It should then be in the project space that you have chosen. You can also check this in the GUI.
Example:
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

Make it easy to run batch jobs on multiple instances
Reproducibility - be able to run code and generate the same outputs given a set of input files
Tie all software to specific versions
Utilize Docker images with multiple bioinformatics software installed
Examples: Rocker Project, GATK4
Docker Registries
Collection of repositories that hold container images
docker pull: pulls the images from a registry to a container on our machine
docker commit: When we commit changes, these changes are saved to the image in registry
There are hard limits for using Docker Images.
DockerHub and other registries have a pull limit of 200 pulls/user/day
Saving a snapshot file to your project lets you scale without these limits
Especially helpful in batch processing
Use images from trusted vendors whenever possible
Examples: Official Ubuntu Image, Amazon Linux Image, Biocontainers
Avoid "kitchen sink" images - hard to manage vulnerabilities
In general: pay attention to possible vulnerabilities and whether they affect your containers
Use dockerfiles to uninstall/patch possible vulnerabilities in images
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Examples: projects in the project tab, different tools you want immediate access to in the tool section.
If you have questions about how to use a json file, please view this section
The navigation.json file for this example is blank. These are the default items in the header
This file must be at least accessible to community members.
This file is optional. It allows you to edit the feature list of projects. _projects, _tools, and supportURL are all optional.
They can be
null, which will remove the item from the header
an array of objects
Text for the text of the new menu item,
url as the destination
newTab for if the link should open up a new tab
If there is another entry, it indicated that a new navigation item needs to be added.
They can be objects with a url and optional parameters. newTab means a link in the navigation when you do this method
They can also be array of objects with text, url, and newTab (which will give it a dropdown menu with listed items)
Please email [email protected] to create a support ticket if there are technical issues.











Most restrictive: view the project, move and copy data across projects.
View and Create folders and modify metadata
Uploader AND run executions
Contributor AND change permissions for users, project ownership, and deletion
is used to represent a group of users
Can be used to simplify the sharing of projects, apps, and billing
Have members and admins
control the access to
Allows the access to the shared apps
This is for what the org is an authorized user for
If the org cannot use the app, the member cannot either
Allow seeing the price column in the UI monitor tab and on the command line
By default, when a project is created, the settings tab shows the following:
The owner of the project can change these
You may want to restrict them depending on your org policy
Copy access
The org allows for the sharing
of the same resources
Control the access as stated above
Org admins can remove and add users
to users performing similar functions
Sharing projects and apps within orgs allows a group of users performing similar functions to be given the same level of access to shared resources.
In this example, there is the org administrator, admin a, who provides view access to the project resources to the org. Additionally, admin A adds users B and C to the org, and also adds admin D to the org.
Admin D then provides upload permissions to the project raw data, and makes the org and authorized user of the QC app. So in addition to being a convenient way to share projects and data, the org aids provide access to apps as well.
You can have multiple people in multiple orgs.
Example 2: Multiple Orgs
Members who are working on two separate projects, and they need access to different data/ apps which have different budgets.
the user may need to create and work on projects that are billed to two step or groups in. This is where creating multiple orgs comes in handy.
Admin D is admin of both org and org-new because admin D needs to work within both of these orgs.
Admin D adds user E to both org and org- new and only adds member or user F to org- new because user only needs to work within org-new.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Before you begin, review the overview documentation and log onto the DNAnexus Platform
The tool library is a set of ready to use apps and work flows that are maintained by DNAnexus
There are different categories and you can search by name of the tool.
Navigate to the Tool Library
In the Any Name search box, start entering "FASTQC...."
Click on the tool name, and you will be at the info tab of the tool.
Select the Version: If you want the same version that is loaded automatically, this is all that you will need to do. If you want a different Version, select the Versions tab and select which version you want.
You can also select "Run" to run the app
There are 2 options for running the tool. First, select "Run" where you find the tool documentation.
Then, there are 2 different UIs for setting up the app to run:
The guided set up, which is what you normally start with
Or, the I/O graph
In the Stage settings tab, you can set the version of the app you want to use, instance type and specify the output folder. By specifying the instance type, you will set the computational resources of the machine on which the analysis will be run. For example, if your input data is large, you will choose an instance type with more storage space available.
Required inputs indicated by asterisks, some are optional.
It is point and click.
Can select your instance here.
can be enabled here. At this time, the feature applies to a batch of inputs. The output is aggregated in one output file. (e.g. 10 inputs results in 1 output).
Once you have selected the app you want to use and read the documentation (if applicable), you will use the guided setup to run the app in the UI.
Set the Output folder
Set the inputs. In the example of FASTQC, it is one FASTQ file
Launch the app using the start analysis button in the upper right
You will automatically be redirected to the monitor page
When the job is completed, you will have buttons to access the inputs (such as a FASTQ file) and outputs (such as an HTML file).
Here is the view when the app is completed:
Using Apps in the GUI
Batch Processing in the GUI
Monitoring An App/ Workflow
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
If you have never used a JupyterLab notebook before, please view this information:
We can interact with the platform in several different ways and install software packages in these different environments depending on what we are wanting to use and how we want to use it. As shown in the diagram below, we will be explaining Jupyter Lab Python/R/Stata and Spark JupyterLab Python/R:
Data Scientists’ tasks can be interactive. Options for interactive analysis in JupyterLab are:
Notebook-based Analysis
Exploratory Data Analysis (EDA)
Data Preprocessing/ Cleaning
Implementing New Machine Learning(ML)/ Model
The work can be done on a single machine instance
Main Use Cases:
Python/R
Image Processing
Working with very large datasets that will not fit in memory on a single instance
Using the Cohort Browser and querying a large ingested dataset
Needing to use Spark based tools such as dxdata, HAIL or GLOW
Select JupyterLab with Python, R, Stata, ML, Image Processing or JupyterLab from Spark from the Tool Library, or select “Start Analysis” from the project space and select JupyterLab from the tool list. Once selected, press “Run Selected”
Select the output location, and change the job name if desired.
Then, select the inputs you intend on using
Snapshot file (not required, and how to create a snapshot is in the Utilizing Snapshot section)
Input files (not required, can do in the notebook analysis)
Stata settings file (license required for Stata)
Then, press “Start Analysis” in the far right corner
Next, confirm the following parameters:
Job Name
Output Folder
Priority (defaults to normal, can be set to high)
Then, press “Launch Analysis”
When redirected to the monitor tab, select the job name
It will redirect you to the details of the JupyterLab job. Wait for the job to start running, and for the worker URL to appear
Press “Open Worker URL” and the JupyterLab home page will appear
Note: Sometimes, the job is still initializing, so if you press Open Worker URL immediately, it may show a 502 error message. This is okay, and the job will update when the job is finished initializing.
Running instances may take several minutes to load as the allocations become available.
Nextflow's errorStrategy directive allows you to define how the error condition is managed by the Nextflow executor at the process level.
There are 4 possible strategies:
terminate (default)
terminate all subjobs as soon as any subjob has an error
finish
when any subjob has an error, do not start any additional subjobs and wait for existing jobs to finish before exiting
ignore
pretend you didn't see it..just report a message that the subjob had an error but continue all subjobs
retry
The DNANexus nextflow documention has a
Generally the errorStrategy is defined in either the base.config (which is referenced using includeConfig in the nextflow.config file) or in the nextflow.config file.
In nfcore pipelines, the default errorStrategy is usually defined in base.config and it is set to 'finish' except for error codes in a specific numeric range which are retried.
The code below is from the
The maxRetries directive allows you to define the maximum number of times the exact same subjob can be re-submitted in case of failure and the maxErrors directive allows you to specify the maximum number of times a process (across all subjobs of that process executed) can fail when using the retry error strategy. .
In the code above, if the exit status of the subjob (task) is within 130 to 145, inclusive, or is equal to 104, then it will retry that subjob once (maxRetries = 1). If other subjobs of the same process also have the same issue, they will also be retried once (maxErrors = '-1' disables the max number of times any process can fail so if every subjob executed for a particular process failed it will allow it to be retried the number of times set in maxRetries). Otherwise, the finish errorStategy is applied and the subjob is terminated pending but other running non-errored subjobs are allowed to complete.
For example, imagine you have a fastqc process that takes in one file at a time from a channel with 3 files (file_A, file_B, file_C)
The process is as below and is run for each file in parallel
fastqc(file_A)
fastqc(file_B)
fastqc(file_C)
If the subjob with file_A and the subjob with file_C fail first with errors in range 130-145 or with a 104 error, they can each be retried once if maxRetries =1 .
Now imagine that you set maxErrors = 2. In this case, there are 3 instances of the process but only 2 errors are allowed for all instances of the process. Thus, it will only retry 2 of the subjobs e.g. fastqc(file_A) fastqc(file_C)
If fastqc(file_B) encounters an error at any point, it won't be retried and then the whole job will go to the finish errorStrategy.
Thus, disabling the maxErrors directive by setting it to '-1' allows all failing subjobs with the specified error codes to be retried X amount of times with X set by maxRetries.
Check what version of dxpy was used to build the Nextflow pipeline and make sure it is the newest
Look at head-node log (hopefully it was ran with "debug mode" as false because when true, the log gets injected with details which isn't always useful and can make it hard to find errors)
Look for the process (sub-job) which caused the error, there will be a record of the error log from that process, though it may be truncated
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.
Please note: in order to use Cohort Browser on the Platform, an Apollo License is needed.
Cohort combine logic allows you to combine existing cohorts with Boolean Logic operations
Here is a summary of the functions for Cohort Combine
All cohorts must be from the same dataset.
All cohorts must be saved before being combined.
A cohort that is a result of combine cannot be combined a second time.
Cohorts from different projects can be combined if they use the same underlying database.
Add your cohorts into the cohort browser by selecting “Load Saved Cohort” if the cohort has already been created and saved into the project, or “New Cohort” if a new cohort needs to be created. You can select up to 10 cohorts to load into the side menu.
Pick your cohort and add it to the browser. It will look like this.
At the bottom of the cohort tab, select "Combine Cohorts"
You will then have the following screen to combine. Pick your cohort combine logic, then select combine
Important note: the order of the cohorts matters in this.
Important notes:
Cohort must be saved before creating its complement (same rule as previous)
A combined cohort (Intersection, Union, Subtraction, Unique) can be used to create a complement.
A cohort created as a complement cannot be further used for combine / complement.
Applet
App
Purpose
Early development, experiment with analyses
Applet is stable, ready to use and possibly moved to a wider audience
When publishing an app, the following items are needed:
A working applet that you have tested
A name that is unique. Generally, the recommendation is to have an abbreviation for your org as part of the name. Example: If the org is named “academy_demos” and the app is for fastqc, then the name of the app could be “academy-fastqc”, “academy_demo-fastqc”, or “academydemo-fastqc”.
Documentation to add to a README.md for users to understand what your app does
Developer notes for you to keep track of version information and added to the Developer README.md
Use dx get applet-name to have the most recent version of your applet
Make your changes to the dxapp.json
Then use dx build app_name --publish --app
Forget to add users or need to add more users? Use:
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.
Before you begin, set up a DNAnexus Platform account here: https://platform.dnanexus.com/login
There are several ways to interact with the platform. All of these will be covered in future lessons/ courses/ documentation.
We are going to be focused on the user interface (highlighted in green), also known as UI.
This information can also be found in the for the Platform
First, what is a project?
It is a collaborative workspace
The smallest unit of sharing on the platform
A place to store objects that are made on the platform
Examples of these objects can be files, applets, and workflows
The user folder is the storage area for your output files
You can add more folders into your user folder for organization (maybe one for data, each project, etc. This is however you and your organization/ company wants to do this)
Data can be in one of 3 states
Open: initial, empty state, awaits upload
Closing: uploading, not instantaneous
Closed: Finalization completed, available for next steps
Log into the
When you login, you will see a list of projects that you are a part of.
Navigating to a project
We have prebuilt projects for you
Copying means from one project to another project
You cannot copy within the same project because of the file ID.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
R needs to run in a regular notebook and for downstream analysis
If directly interacting with the database/ dataset, it is recommended that you either 1) use Python and/ or 2) use Spark for extracting the data that is relevant for the downstream analysis
Create a DX JupyterLab Notebook so that it will automatically save onto the Trusted Research Environment. You can do so by selecting these 2 different options:
Option 1 is from the Launcher:
b. Option 2 is from the DNAnexus Tab:
Start writing your JupyterLab Notebook. Select which kernel you are going to use (options will vary depending on the Image you selected in set up).
Download packages and save the software environment as a snapshot
Download Packages
b. Save the Snapshot of the environment
Notebooks can also be directly opened from project storage
When you save in JupyterLab, the notebook gets uploaded to the platform as a new file. This goes back to the concept of immutability.
The old version of notebook goes into .Notebook_archive/ folder in project.
A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).
The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads
The citation for this synthetic dataset is:
Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007
PygWalker shows a sample of the dataset in a table format
PygWalker simplifies data analysis and visualization by transforming pandas dataframes into an interactive interface for easy exploration. It is available within the table-level view of the application. To use it, simply click the Go to Explorer Mode button to access the raw data slide. You can learn more about its features by referring to the documentation or watching demo videos .*
A custom plot created with PygWalker
*DNAnexus is not responsible for the accuracy or updating of any 3rd party content or applications*
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
If your nextflow run fails, the nextflow job log is written to your project Output location (CLI flag --destination) that you set for the applet at runtime.
However, on failure, your results files in params.outdir are not written to the project, unless you are using the 'ignore' error strategy.
To guard against having long running or expensive (or both!) runs that you get no output from when they fail you need to think carefully about what should happen when your job fails and if you need the ability to resume it. This means that successfully completed processes won't be run again saving you the cost and time of running already successfully completed jobs.
Nextflow has a resume feature to enable runs that fail to be resumed again which
A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via ).
The data used in this section of Academy documentation can be found here to download:
The citation for this synthetic dataset is:
Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007.
pip install ___ #pythondocker pull broadinstitute/gatk docker save broadinstitute/gatk -o gatk.tar.gzdx upload gatk.tar.gz {
}{
"_projects": null, #deletes the current list of projects
"_tools": [
{"text": "Custom Menu Item", "url": "http://example.com"}, #creating a new item within tools
{"text": "Opens in New Tab", "url": "http://example.com", "newTab": true} #creating a new tab in tools
],
"_help": null, #removes help
"A New Menu": [
{"text": "New Menu Item", "url": "http://example.com"}, #new menu
],
"A New Link": {"url": "http://example.com", "newTab": true} #new link
}Stata
Update the Duration if desired
Add Commands to run in the JupyterLab environment (optional)
Finally, update the Feature. For a full list of packages in each feature, please look in the Preinstalled Packages List. The options are
Python_R
ML
IMAGE_PROCESSING
STATA
MONAI_ML
Spending Limit (optional)
Instance Type (change the default value if needed)







Cohort combine are very complex queries
Beware of performance delays and timeouts as query gets more complex.
Use extra caution when:
Combining cohorts with genomic filters
Combining cohorts with complicated filters
Combining cohorts based on very large datasets
















billable activities
shared apps
shared projects
are either allowed or not allowed to access
billable activities
shared apps
shared projects
is a single user on the platform
can be an org admin or an org member
they can also just be added to project and NOT a member of an org, but they will not see pricing or have access to org specific options unless part of the project itself.
holds one of 4 types of permissions to a project
could be to limit how the data is handled
Can be changed from all members to no one
Delete Access
Limit how the data is handled
Can be changed from Contributors and Admins to Admins only
Download Access
Limit who can see the data (this would allow accessing the data outside of the platform)
Can be changed from all members to no one
Org admins can define projects and project access
Introduce apps and app access
Looking at permissions associated with each of these users, admin A, and users B and C, have access only to org. Whereas admin D and user E have access to both org and org-new. And user F only has access to org-new.
Orgs are flexible tools used to represent groups of users that can be used to simplify resource sharing, consolidate, billing, and associating platform work to real world billing structures.

















Look at the raw code
Look at the cached work directories
.command.run runs to setup the runtime environment
Including staging file
Setting up Docker
.command.sh is the translated script block of the process
Translated because input channels are rendered as actual locations
.command.log, .command.out etc are all logs
Look at logs with "debug mode" as true
when a subprocess returns an error, retry the process
None is present at the applet creation
Each time the app is built, it must be given a new version.
A default spending account set up for yourself as the app author. For published apps, they will require storage for their resources and assets, and the storage will be billed on a monthly basis to the billing account of the original author of each app. You can set multiple authors, but the original author is where the billing is tied to.
Decide if you want the app to be open source. In dxapp.json, add a key called "openSource" with a boolean value of true or false.
A consistent version policy for your meaningful updates. DNAnexus suggests Semantic Versioning.
Add authorized users. In dxapp.json, add a key called "authorizedUsers" with a value being an array of strings, corresponding to the entities allowed to run the app. Entities are encoded as user-username for single users, or org-orgname for organizations.
Perks of Each
Easy to collaborate, members of the project can edit the code, and publish
Once published, the app cannot be modified version control enforced can carry assets in their own private container.
Goal
Adding executable in to an application for increased efficiency in usage + ability to edit code efficiently
Adding executable in to an application for increased efficiency in usage + enhance reproducibility and minimize risk
Applets
Apps
Location
in projects
in the Tool Library, if you are the developer or an authorized user
Naming Structure
project:/applet_ID
project:/folder/name
app-name
Can they be shared?
Through projects, as a data object
App developer manages a list of users authorized to access the app
Updating
Deleting the previous applet with the same name, and creating a new one
New version per release
Versioning
Load Packages
b. Download or Access data files to the JupyterLab environment
c. Import the data
d. Then, perform the analysis for your data
e. Upload results back to Project Space
Save your DX Jupyterlab Notebook



import dxdata
import pprint
import pyspark
from pyspark.sql import functions as Fdx extract_dataset dataset_id -ddd --delimiter sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)%%bash
dx upload FILE --destination /your/path/for/results // memory errors which should be retried. otherwise error out
errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'dx add users USER OR ORG NAME OF APPpip install ___ #python
install.packages() #Rimport ____ #python
library() #R%%bash
#option 1: dx download
dx download "PATH TO FILE"
#option 2: dx fuse
data = pd.read_csv("/mnt/project/PATH.csv")import ___ as pd
NAME = pd.read_csv("PATH.csv")%%bash
dx upload FILE --destination /your/path/for/resultsNotice, you will automatically return to the Info tab for that version.
You will have a review step. This is to review the content as well as add additional parameters such as a spending limit.





A place to contain details of running jobs/ analyses and their results
In your project space, select "DNAnexus Academy 101"
Navigate to the users folder and use Add > New Folder



preserve_cache to true for the initial run. This will cache the nextflow workDir of the run in your project on platform in a folder called .nextflow_cache_db/<session_id>/.The session ID is a unique ID given to each (non-resumed) Nextflow run. Resumed Nextflow runs will share the same session ID as the run that they are resuming since they are using the same cache.
The cache is the nextflow workDir which is where nextflow stores each tasks files during runs. By default when you run a nextflow applet, preserve_cache is set to false. In this state, if the applet fails you will not have the ability to resume the run and you are not able to see the contents of the work directory in your project.
To turn on preserve_cache for a run add -ipreserve_cache=true to your run command.
In the UI, scroll to the bottom of the Nextflow run setup screen
So if you are running a job and think there is a chance that you might want to resume it if it fails, then turn on preserve_cache.
Note that if you terminate a job manually i.e., using the terminate button in the UI or with dx terminate the cache will not be preserved and you will not be able to resume the run even if preserve_cache was set to true for the run. The same applies if a job is terminated due to a job cost limit being exceeded. Essentially, if it is not the DNAnexus executor terminating the run, then the cache is not preserved and so resuming the run is not possible.
You can store up to 20 caches in a project and a cache will be stored for a maximum of 6 months. Once that limit has been reached you will get a failure if you try to run another job with preserve cache switched on. In practice you should regurlary delete your cache folders once you have had successful runs and no longer need them to save on storage costs.
You can make changes to the Nextflow applet, dx build it again and/or make changes to the run inputs before resuming a run.
When you resume a run in the CLI using the session ID, the run will resume from what is cached for the session id on the project.
Only one Nextflow job with the same session ID can run at any time.
When resume is assigned with 'true' or 'last, the run will determine the session id that corresponds to the latest valid execution in the current project and resume the run from it
or
To setup the sarek command to preserve the Cache
To resume a sarek run and preserve updates to the cache from the new run (which will allow further resumes in case this resumed run fails) use the code below:
To get the session-id of a run, click the run in the monitor tab of your project and scroll down to the bottom of the page. On the bottom right you should see the session ID in the 'Properties' section
If you know your job ID, you can also use that to get the session ID on the CLI using
Check what version of dxpy was used to build the Nextflow pipeline and make sure it is the newest
Look at head-node log (hopefully it was ran with "debug mode" as false because when true, the log gets injected with details which isn't always useful and can make it hard to find errors)
Look for the process (sub-job) which caused the error, there will be a record of the error log from that process, though it may be truncated
Look at the failed sub-job log
Look at the raw code
Look at the cached work directories
.command.run runs to setup the runtime environment
Including staging file
Setting up Docker
Look at logs with "debug mode" as true
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
The Table-level screen appears when the user selects one particular table in the Navigator.
Table-level Screen of a table in Data Profiler
Overview details on the header of the Table-level screen
On the header of the Table-level screen, the user can find overall statistics on the selected table, that include:
Table size: number of rows and columns of the table
Missing rate: the rate of empty cells in the table
Duplicate rate: the rate of duplication of an entire row in the table
Pie chart of Column types on the header of the Table-level screen
The pie chart shows the composition of column types in the table. The size of each part of the pie is determined by the number of columns of that type. The user can also hover on the chart to get the count value.
Table-level screen has a Controller section that configures the visualization in the Chart area
The main function of the Table-level Screen is the Chart Area, which is controlled by a Controller in the top right corner of the screen. There are 2 main types of visualizations: Completeness and Column Profiles.
Completeness is the default mode of the Table-level screen. It aims to provide an overview on the count/rate of non-null values in a table. Completeness has 2 options: One-way view and Two-way view
One-way view in Table-level screen
One-way view is a stacked bar chart that displays the percentage of missing values, non-duplicates, and duplicates for each column in the table. You can click on the Legend/Key to show or hide specific statistics on the chart. Hover over each column to view detailed statistics.
Two-way view in Table-level screen
Two-way view is a heat map showing data completeness for all columns in the table. The Y-axis of the heatmap is the columns of the table. The X-axis of the heatmap is the unique values of the group-by column. The value of the heatmap shows how many entities (in the Raw count mode, or percentage in the Percentage mode) of the table have non-null values on the columns (y-axis) with respect to the value of the group-by column (x-axis). . The user can choose another column as the grouping factor. Each label in this Group-by column is a column in the heat map. Only categorical columns which have a maximum of 30 unique values will show up as the options.
The Controller of Two-way view
The numbers in the heat map can be configured in two ways:
Raw count displays the exact number of values available in each column.
Percentage shows the completeness statistic as a percentage. The completeness statistic ranges from 0 to 100, where 0 means the data is completely missing, and 100 indicates that the data is 100% complete.
Two-way View: Heat map, cross-table analysis
The user can also join the current table of another table using the Join with table options. By joining with another table, the user can use a column from that table as the Group-by column.\
FAQs
Question: Can I use the Two-way View to check how many female patients have sequencing data?
Answer: Yes. Assuming that your question involves 2 metadata: patient_sex (from the patient table) and sequencing_run_id (from the sequencing table). The patient and sequencing table are join-able by patient_id. If that is the case, you can open the patient table with the Two-way View; join it with the sequencing table; and choose patient_sex as the Group-by column. On the sequencing.sequencing_run_id, you can see the completeness rate broken down by each sex in patient_sex.
The heatmap options controller when doing cross-table analysis. We are joining "patients" table into the "observations" table
Completeness heatmap in case of cross-table analysis. In this example, the main table is "patients", the joined table is "observations". This heatmap shows how many patients who have available data (not-null values) on the fields which respect to the patient race: white, black, asian, native, or other
Column Profiles mode shows each column as a tile. The chart type depends on the type of the column.
This screen provides detailed statistics and distribution charts for the columns in the table. For all column types, it displays the missing rate and the duplication rate.
For columns containing string data, it shows the number of unique values and the value frequency, which is represented in a distribution chart.
For columns containing float data, the screen provides information about the variance, standard deviation, and the value range frequency, which is displayed in a distribution chart. Additionally, a box plot is shown, illustrating the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.
For columns containing datetime data, the screen displays the variance, standard deviation, and value range frequency on a distribution chart. A box plot is also provided, showing the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
You can add different sections, links, projects, etc into the json file
If you have questions about how to use a json file, please view this section
This is the file to create what you see above
Other parameters to the header section
Please email [email protected] to create a support ticket if there are technical issues.
Portable Bash System (PBS) or SLURM
dx-toolkit
Worker
Requested from pool of machines in private cluster
requested from pool of machines in AWS/ Azure
Shared Storage
Shared file system for all nodes (Lustre, GPFS, etc)
Project storage (Amazon S3/ Azure storage)
Worker File I/O
Handled by Shared file system
needs to be transferred to and from project storage my commands on worker
With an HPC, there is a collection of specialized hardware, including mainframe computers, as well as a distributed processing software framework so that the incredibly large computer system can handle massive amounts of data and processing at high speeds.
The goal of an HPC is to have the files on the hardware and to also do the analysis on it. In this way, it is similar to a local computer, but with more specialty hardware and software to have more data and processing power.
Your computer: this communicates with the HPC cluster for resources
HPC Cluster
Shared Storage: common area for where files are stored. You may have directories branching out by users or in another format
Head Node: manages the workers and the shared storage
HPC Worker: is where we do our computation and is part of the HPC cluster.
These work together to increase processing power and to have jobs and queues so that when the amount of workers that are needed are available, the jobs can run.
In comparison, cloud computing adds layers into analysis to increase computational power and storage.
This relationship and the layers involved are in the figure below:
Let's contrast this with processing a file on the DNAnexus platform.
We'll start with our computer, the DNAnexus platform, and a file from project storage.
We first use the dx run command, requesting to run an app on a file in project storage. This request is then sent to the platform, and an appropriate worker from the pool of workers is made available.
When the worker is available, we can transfer a file from the project to the worker.
The platform handles installing the app and its software environment to the worker as well.
Once our app is ready and our file is set, we can run the computation on the worker.
Any files that we generate must be transferred back into project storage.
HPC jobs are limited by how many workers are physically present on the HPC.
Traditionally, cloud computing has better architecture than an HPC, so the jobs are more efficient.
One common barrier is getting our files onto the worker from project storage, and then doing computations with them on the worker. The last barrier we'll review is getting the file outputs we've generated from the worker back into the project storage.
Cloud computing has a nestedness to it and transferring files can make learning it difficult.
A mental model of how cloud computing works can help us overcome these barriers.
Cloud computing is indirect, and you need to think 2 steps ahead.
Here is the visual for thinking about the steps for file management:
Creating apps and running them is covered later in the documentation.
Apps serve to (at minimum):
Request an EC2/Azure worker
Configure the worker's environment
Establish data transfer
Highly secure platform with built-in compliance infrastructure
Fully configurable platform
User can run single scripts to fully-automated, production-level workflows
Data transfer designed to be fast and efficient
Read and analyze massive files directly using dxfuse
Instances are configured for you via apps
Variety of ways to configure your own environments
Access to the wealth of
Largest Azure instances: ~4Tb RAM
Largest AWS instances: ~2Tb RAM
Run Job
dx run <app-id> <script>
qsub <script>
sbatch <script>
Monitor Job
dx find jobs
qstat
squeue
Kill Job
dx terminate <jobid>
qdel <jobid>
Single Job
Use `dx run` on the CLI directly
Use `dx run` in a shell script
Use a shell script to use `dx run` on multiple files
Use dxFUSE to directly access files (read only)
/ dx run --batch-tsv
1
List Files
List Files
2
Request 1 worker/ file
Use loop for each file: 1) use dx run, 2) transfer file, and 3) run commands
3
use array ids to process 1 file/worker
4
submit job to head node
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Driver/ Requestor
Head Node of Cluster
API Server
Submission Script Language
There are about 100 dx commands, which you can find by executing dx help all:
add: Add one or more items to a list
add developers: Add developers for an app
add member: Grant a user membership to an org
You are now able to:
Describe how to use metadata and the dx find data command on the CLI
Create and use batch file processing using the CLI
Describe the use cases that warrant the Cloud Workstation
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
There is an existing public Docker image available for CNVkit ("etal/cnvkit:latest"), so another option is to build a WDL version that will download and use this image at runtime rather than installing the Python and R modules ourselves.
In this example, you will:
Use WDL and Docker to build the CNVkit
To start, create a new directory called cnvkit_wdl parallel to the bash directory. Inside this new directory, create the file workflow.wdl with the following contents:
Next, ensure you have a working Java compiler and then download the latest dxCompiler Jar file. You can use the following command to place the 2.10.3 release into your home directory:
Use the dxCompiler to turn workflow.wdl into an applet equivalent to the bash version. In the following command, the workflow and all related applets will be placed into a workflows directory in the given project to keep all this neatly contained. The given the project ID project-GFf2Bq8054J0v8kY8zJ1FGQF is the caris_cnvkit project, so change this to if you wish to place this into a different project. Note the use of the -archive option to archive any existing version of the applet and allow the new version to take precendence and the -reorg to reorganize the output files. As shown in the following command, successful compilation will result in printing the new workflow's ID:
Run the new workflow with the -h|--help flag to verify the inputs:
As with the bash version, you can launch the workflow from the CLI as follows:
The resulting output will show the JSON you can alternatively use to launch the job:
Following is the command you can use to launch the workflow from the CLI with the JSON file:
As before, you can use the web interface to monitor the progress of the workflow and inspect the outputs.
Run the following command to start a new cloud workstation:
From the cloud workstation, pull the CNVkit Docker image:
Save and compress the image to a file:
Add the tarball to the project:
Update the WDL to use the tarball:
Build the app and run it.
In this chapter, you learned another strategy for packaging an applet's dependencies using Docker and then running the applet's code inside the Docker image using WDL.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
In this chapter, you'll learn to create an applet that uses the executable from the FASTX-Toolkit collection of command-line tools for processing short-read FASTA and FASTQ files. You'll use the applet to run FastQTrimmer on a FASTQ file, creating a trimmed reads file that you can then use for further analysis.
You will learn the following:
How to accept an optional integer argument from the user
How to add resource files to an applet such as a binary executable that can be used in your applet code
Run dx-app-wizard mytrimmer to create the mytrimmer applet. You have already added the app name, so you can press enter when prompted. You can add a title and summary if you would like, as well as version.
Start the input specification with the input FASTQ:
Next, indicate an optional integer for the quality score:
Press Enter to skip a third input and move to the output specification, which should define a single output file:
Press enter to exit the output section.
Set a timeout policy if you would like.
Answer the remaining questions to create a bash applet. The applet does not need access to the internet or parent project, and you can choose the default instance type.
Open the mytrimmer/dxapp.json in a text editor to view the inputSpec:
To make input file selection more convenient for the user, edit the patterns for the file extensions of the input_file to be those commonly used for FASTQ files:
These patterns are used in the web interface to filter files for the user, but it's not a requirement that the input files match these patterns. The file filter can be turned off by the user, so these patterns are merely suggestions.
Next, you will add a binary executable file from the FASTX toolkit. Download and unpack the FASTX toolkit binaries:
Then make the executable with the make file. This will create your executable.
The files are also here to download and for you to unpack:
Create the directory resources/usr/bin inside the mytrimmer directory:
When the app is bundled, the directory structure in the resources directory will be compressed and unpacked as is on the instance, so you should create a directory that is in the standard $PATH such as /usr/bin or /usr/local/bin.
This applet only requires the fastq_quality_trimmer binary, so copy it to the preceding directory:
You should remove the downloaded binary artefacts as they are no longer needed.
Update mytrimmer/src/mytrimmer.sh with the following code:
The variables $input_file and $input_file_name are based on the inputSpec name input_file. The first is a record-like string {"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}, while the latter is the filename small-celegans-sample.fastq.
The variable $input_file_prefix is the name of the input file without the file extension, so small-celegans-sample, which is used to create the output filename small-celegans-sample.filtered.fastq. See .
You don't need to indicate the full path to fastq_quality_trimmer because it will exist in the directory /usr/local/bin, which is in the standard $PATH.
Add the sample FASTQ file to the project either by using the URL importer as shown in Figure 6, or download the file to your computer and upload through the web interface or using dx upload:
Use dx build to build the applet:
Run the applet with the -h|--help flag from the CLI to see the usage:
Run the applet using the file ID of the FASTA file you uploaded:
The job's output should end with something like the following:
You can select the output file and view the results.
You can download the output file and check that the filtering actually removed some of the input sequences by using wc to count the original file and the result:
Run the applet with a higher quality score and verify that the result includes even fewer sequences.
In this chapter, you learned how to do the following:
Indicate an optional argument with a default value
Add a binary executable to a project in the resources directory and use that binary in your applet
How to use variations on the input file variables to get the full filename or the filename prefix without the extension.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).
The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads
The citation for this synthetic dataset is:
Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007
Data Profiler helps the user explore different levels of a dataset. There are 3 levels of a dataset in Data Profiler:
Dataset level: Show relationships between tables in the dataset and overview of all tables, columns in the dataset
Table level: Show statistics of one particular table. It can also join with another table to create a joint profile.
Column level: Show statistics of one particular column of a table. It can also combine with other columns in the same table to create a joint profile.
To navigate between these 3 levels, the user can select from a navigator on the left side of the application. Once an option of the navigator is selected, the content of the main interface will change accordingly.
The user interface of Data Profiler consists of a navigator (left, highlighted in blue), which controls the content of the main section (right, highlighted in green).
Navigator controls the content on the main section of Data Profiler. The main component of the Navigator is a hierarchical structure of the dataset, called Data Hierarchy
The top level of a Data Hierarchy is All Tables, indicating the dataset level. This level is selected by default.
Under All Tables are individual tables in the dataset. Each table has a number on the far right indicating the number of columns in the table.
Once a table is selected, the Data Hierarchy will show all columns from that table. Each column has a colored tag indicating the column type.
Above the Data Hierarchy, the user can search for one or more columns. The Data Hierarchy will show tables that have at least one of the column names in the search list (OR logic).
At the bottom of the Navigator, the user can switch to an Explorer Mode to create charts on their own. The functionality of this mode is discussed in another section of this document.
The 📜 button shows the Inference Logs Screen that show details on the profiling process. This feature is in development.
The type of a column in Data Profiler can be specified in a data_dictionary. If that information is not available, Data Profiler will infer the column type based on the content of the column.
In Data Profiler, there are 4 column types. These types are consistent with the data types used in the via Data Model Loader on DNAnexus platform:
Null (or empty) values are allowed in all column types and they do not affect how a column type is determined.
In my data_dictionary, the type of column A is “integer”. After loading with Data Profiler, the application says column A is a “string” column. What happened?
There is at least one non-null arbitrary value in column A that cannot be cast to an integer. Therefore, the Data Profiler falls back to “string”.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.
Nextflow pipelines are composed of processes e.g., a task such as fastqc would be one process, then read trimming would be another process etc. Processes pass files between them using channels (queues) so every process usually has an input and output channel. Nextflow is implicitly parallel - if it can run something in parallel, it will! There is no need to loop over channels etc.
For example you could have a script with a fastqc and read_trimming processes which take in a fastq reads channel. As these two process have no links between them they will be run at the same time.
The Nextflow workflow file is called main.nf.
Lets think about a quick workflow that takes in some single-end fastq files, runs fastqc on them, then trims them, runs fastqc again and finally runs multiqc on the fastqc outputs.
An example of code that would achieve the workflow in the image (not showing what each process script looks like here)
An example local run (not on or interacting with DNAnexus) would look like the command below. This assumes you have Nextflow on your own local machine, which is not required for DNAnexus
As we gave --fastq_dir a default, if your inputs match that default you could just run
DNAnexus has developed a version of the Nextflow executor that can orchestrate Nextflow runs on the DNAnexus platform.
Once you kick-off a Nextflow run, a Nextflow 'head-node' is spun up. This stays on for the duration of the run and it spins up and controls the subjobs (each instance of a process).
orchestrates subjobs
contains the Nextflow output directory which is usually specified by params.outdir in nfcore pipelines
copies the output directory to the DNAnexus project once all subjobs have completed (--destination)
one for every instance of a process
each subjob is one virtual machine (instance) e.g., fastqc_process(fileA) is run on one machine and fastqc_process(fileB) is run on a different machine
Uses a Docker image for the process environment
Required files pulled onto machine and outputs sent back to head node once subjob completed
Nextflow uses a 'work' directory (workDir) for executing tasks. Each instance of a process gets its own folder in the work directory and this directory stores task execution info, intermediate files etc.
Depending on if you choose to or not, you will be able to see this work directory on the platform during/after your nextflow run.
Otherwise, the work directory exists in a and it will be destroyed once a run has completed.
You may have learned about batching some inputs for WDL workflows previously. You do not need to do this for Nextflow applets - all parallelisation is done automatically by the Nextflow.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.
A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).
A Note on Data:
The data used in this section of Academy documentation can be found here to download: https://synthea.mitre.org/downloads
The citation for this synthetic dataset is:
Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007
Column-level screen shows a string column
For columns containing string data, the data profiler will display several statistics and charts to help analyze the data.
The statistics include:
The missing rate, expressed as a percentage of the missing values in the column.
The number of unique values present in the column.
The charts provided include:
Top Records Bar Chart: This chart displays the top values that occur most frequently in the column. You can select how many top records to display using a dropdown list. By hovering over the bars, you can see the exact count of records for each value.
Character Length Distribution Chart: This chart shows how the lengths of the strings are distributed. By hovering over different parts of the chart, you can view the range of character lengths and how frequently each range occurs. Besides, the average length of the strings in the column and standard deviation (which measures the amount of variation in the string lengths) are also reported.
Boxplot: The boxplot provides a visual summary of the data in terms of its distribution, showing the maximum value, Q3 (upper quartile)
Column-level screen shows a float column
For columns containing float data, the data profiler provides several statistics and charts to help analyze the data.
The statistics include:
The missing rate, displayed as a percentage of missing values.
The standard deviation, which measures the spread of the data values.
The Interquartile range, which measures the difference between the 75th and 25th percentiles of the data.
The charts provided include:
Distribution Chart: This chart displays the distribution of values in the column. You can hover over the chart to view the range of values and their frequencies.
Boxplot: The boxplot visually represents the distribution of the data, showing the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.
Grouping Frequency Chart (Two way plot): This chart shows the frequency of unique values in the current column, grouped with values from another column. You can select the column for grouping from a dropdown list.
Column-level screen shows a datetime column
For columns containing datetime data, the data profiler provides several statistics and charts for in-depth analysis.
The statistics include:
The missing rate, displayed as a percentage of missing values.
The standard deviation, measuring the dispersion of the datetime values.
The Mode, showing the mode/format of the datetime data in the column.
The charts provided include:
Distribution Chart: This chart shows the distribution of datetime values in the column. You can hover over the chart to view the range of values and their frequencies.
Boxplot: The boxplot visually represents the distribution of the datetime data, displaying the maximum value, Q3 (upper quartile), median, Q1 (lower quartile), and the minimum value.
Radar Chart: This chart displays the frequency of values, grouped by year, month, or day. You can change the grouping option using the dropdown at the top.
Even though each column type has a different layout on the Column-level Screen, Pairwise plot between columns is a common component.
The user can create a plot between the current column and any other column from the same table. However, not all columns are available for this feature. Data Profiler will show columns that satisfy the following conditions:
Not a string column
If it is a string column:
Not a primary key
The number of unique values count is no larger than 30
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Be sure to install .
(JSON) is a data exchange format designed to be easy for humans and machines to read. You will encounter JSON several places on the DNAnexus platform such as when you create and edit native applets and workflows. As shown in Figure 1, JSON is used to communicate with the DNAnexus Application Programming Interface (API) You will need to understand the responses from the API will help you debug applets, find failed jobs, and relaunch analyses.
A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via ).
The data used in this section of Academy documentation can be found here to download:
The citation for this synthetic dataset is:
Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007.
dx run applet-xxxx -ipreserve_cache=truedx run applet-xxxx -iresume='session-id'dx run applet-xxxx -iresume='last'dx run applet-xxxx -iresume=truedx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs_ch' -ipreserve_cache=true -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME' dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs_ch' -ipreserve_cache=true -iresume='last' -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME' dx describe job-ID --json | jq -r .properties.nextflow_session_id
#ID{
"header": {
"logo": "#logo_header.png",
"logoOpensNewTab": true,
"hideCommunitySwitch": true,
"colors": {
"background": "#EEEEEE",
"border": "#EEEEEE",
"text": "#000000"
}
},
"homeURL": "http://academy.dnanexus.com"
}{
"header": {
"logo": "#logo_header.png", #image for the logo; has to be an appropriate size. min 15x15px, max 50x30px
"logoOpensNewTab": true, #opens new tab if you select the logo
"hideCommunitySwitch": true,
"colors": {
"background": "#123456", #background color for the header
"border": "#123456", #border color for the header
"text": "#123456", #text color
}
} "header": {
"colors": {
"hoverBackground": "#123456", #hover background color
"userColors": ["#123456", "#234567", "#345678"], #user colors
"button": {"success": {"border-color": "green", "background":
"pink", "hover": {"background": "dusk"}}} #setting colors for buttons or hover selections
}"login": {
"logo": "#logo_login.png", #image for login
"text": "# ADD TEXT IN MARKDOWN FORMAT HERE.",
"colors": {
"loginButton": "#123456" #set color for login button here
}"register": {
"disable": true,
"logo": "#logo_register.png", #image for registering
"text": "#ADD TEXT IN MARKDOWN FORMAT HERE.",
"agreeToText": "Plain text you need to agree to before registering", #plain text, string
"region": "aws:us-east-1",
"colors": {
"registerButton": "#123456" #color for register button
}"homeURL": "http://example.com", #url for logo
"supportURL": "http://example.com/support", #support URL
"hideCommunitySwitch": true,
"description": "A short description of two or three lines for the community selector" #description for the community # Generate batch file by regex
$ dx generate_batch_inputs -iinput_fwd='(.*)_R1_001.fastq.gz' -iinput_rev='(.*)_R2_001.fastq.gz'
# Show the local file
$ cat dx_batch.0000.tsv
# Use the local batch file
$ dx run fastp --batch-tsv dx_batch.0000.tsv -iadapter_fa=/data/adapters.fa -iprefix='Sample1'add stage: Add a stage to a workflowadd users: Add authorized users for an app
add_types: Add types to a data object
api: Call an API method
archive: Requests for the specified set files or for the files in a single specified folder in one project to be archived on the platform
build: Create a new applet/app, or a workflow
build_asset: Build an asset bundle
cat: Print file(s) to stdout
cd: Change the current working directory
clearenv: Clears all environment variables set by dx
close: Close data object(s)
cp: Copy objects and/or folders between different projects
describe: Describe a remote object
download: Download file(s)
env: Print all environment variables in use
exit: Exit out of the interactive shell
extract_dataset: Retrieves the data or generates SQL to retrieve the data from a dataset or cohort for a set of entity.fields. Additionally, the dataset's dictionary can be extracted independently or in conjunction with data. Listing options enable enumeration of the entities and their respective fields in the dataset.
find analyses: List analyses in the current project
find apps: List available apps
find data: List data objects in the current project
find executions: List executions (jobs and analyses) in the current project
find globalworkflows: List available global workflows
find jobs: List jobs in the current project
find org members: lists members in the org
find org projects: lists projects billed to the org
find org apps: lists apps billed to the org
find org apps: List apps billed to the specified org
find org members: List members in the specified org
find org projects: List projects billed to the specified org
find orgs: List orgs
find projects: List projects
generate_batch_inputs: Generate a batch plan (one or more TSV files) for batch execution
get: Download records, apps, applets, workflows, files, and databases
get_details: Get details of a data object (cf details)
head: Print part of a file
help: Display help messages and dx commands by category
install: Install an app
invite: Invite another user to a project or make it public
list database: List entities associated with a specific database. For
list database files: lists database files associated with a specific database.
list developers: List developers for an app
list stages: List the stages in a workflow
list users: List authorized users for an app
login: Log in (interactively or with an existing API token)
logout: Log out and remove credentials
ls: List folders and/or objects in a folder
make_download_url: Create a file download link for sharing
mkdir: Create a new folder
mv: Move or rename objects and/or folders inside a project
new org: Create new non-billable org
new project: Create a new project
new record: Create a new record
new user: Create a new user account
new workflow: Create a new workflow
publish: Publish an app or a global workflow
pwd: Print current working directory
remove developers: Remove developers for an app
remove member: Revoke the org membership of a user
remove stage: Remove a stage from a workflow
remove users: Remove authorized users for an app
remove_types: Remove types from a data object
rename: Rename a project or data object
rm: Remove data objects and folders
rmdir: Remove a folder
rmproject: Delete a project
run: Run an applet, app, or workflow
select: List and select a project to switch to
set_details: Set details on a data object
set_properties: Set properties of a project, data object, or execution
set_visibility: Set visibility on a data object
setenv: Sets environment variables for the session
ssh: Connect to a running job via SSH
ssh_config: Configure SSH keys for your DNAnexus account
tag: Tag a project, data object, or execution
terminate: Terminate jobs or analyses
tree: List folders and objects in a tree
unarchive: Requests for the specified set files or for the files in a single specified folder in one project to be unarchived on the platform.
uninstall: Uninstall an app
uninvite: Revoke others' permissions on a project you administer
unset_properties: Unset properties of a project, data object, or execution
untag: Untag a project, data object, or execution
update member: Update the membership of a user in an org
update org: Update information about an org
update project: Updates a specified project with the specified options
update stage: Update the metadata for a stage in a workflow
update workflow: Update the metadata for a workflow
upgrade: Upgrade dx-toolkit (the DNAnexus SDK and this program)
upload: Upload file(s) or directory
wait: Wait for data object(s) to close or job(s) to finish
watch: Watch logs of a job and its subjobs
whoami: Print the username of the current user
Grouping Frequency Chart: This chart displays how often unique values in the current column occur when grouped with values from another column. You can choose the column to group by using a dropdown list.
Grouping Frequency Chart (Two Way Plot): This chart shows the frequency of unique datetime values in the current column, grouped with values from another column. You can select the column for grouping from a dropdown list.





.command.sh is the translated script block of the process
Translated because input channels are rendered as actual locations
.command.log, .command.out etc are all logs




scancel <jobid>



Run fastq_quality_trimmer using the given $quality_score and write to the output filename. The -Q option is an undocumented option to indicate that the scores are in phred 33.
Upload the output file, which returns another record-like string describing the newly created file.
Add the newly uploaded record as a file output of the job.
Task execution status, temp files, stdout, sterr logs etc sent to work directory

Here is an example of an objects inside other objects describing the output of the FastQC app that creates two files as outputs, one of an HTML report and the other of a text file containing statistics on the input FASTQ:
In a later chapter, you will use a file called dxapp.json to build custom applets on DNAnexus. To see a full example from a working app, run dx get app-fastqc to download the source code for the FastQC app. This should create a fastqc directory that contains the file dxapp.json.
Following is a portion of this file showing a typical JSON document you'll encounter on DNAnexus:
The root element of this JSON document is an object, as denoted by the curly brackets.
The value of inputSpec is a list, as denoted by the square brackets.
Each value in the list is another object.
The first three values of this object are strings.
The patterns value is a list of strings representing file globs that match the input file extensions.
The following links explain the dxapp.json file in greater detail:
JSON is a strict format that is easy to get wrong if you are manually editing a file. For this reason, we suggest you use text editors that understand JSON syntax, highlight data structures, and spot common mistakes. For instance, a JSON object looks very similar to a Python dictionary, which allows a trailing comma in a list. Open the python3 REPL (read-evaluate-print-loop) and enter the following to verify:
A similar trailing comma in JSON would make the document invalid. To see this, go to JSONlint.com, paste this into the input box, and press the "Validate JSON" button:
The result should reformat the JSON onto three lines as follows:
The second line should be highlighted in red, and the "Results" below show that a JSON value is expected after the last comma and before the closing square bracket.
Remove the offending comma and revalidate the document to see the "Results" change to "Valid JSON." You may also want to install a command-line tool like jsonlint that can show similar errors:
JSON is not dependent on whitespace, so the previous example could be compressed to the following:
The jq program will format JSON into an indented data structure that is easier to read. In the following example, we execute jq with the filter . to indicate we wish to see the entire document, which is the last argument. Depending on your terminal, the keys may be shown in one color and the values in a different color:
The power of jq lies in the filter argument, which allows you to extract and manipulate the contents of the document. Use the filter .report_html to extract the value for key report_html that lies at the root of the document:
::: note If you request a key that does not exist, you will get the JavaScript value null, indicating no value is present: :::
Filters may chain keys to search further into the document structure. In the following example, we can extract the file identifier by chaining .report_html.dnanexus_link:
Unix-type operating systems such as Linux and FreeBSD/macOS have three special filehandles:
STDIN (standard in)
STDOUT (standard out)
STDERR (standard error)
STDOUT and STDERR control the output of programs where the first is usually the console and the second is an error channel to segregate errors from regular output. For instance, the STDOUT of jq can be redirected to a file using the > operator:
STDIN is an input filehandle created by using a pipe (|) in the following example:
Alternatively, you can read from an input redirect using <:
Many dx commands can return JSON by appending the --json flag to them. For instance, dx describe app-fastqc will return a table of metadata about the FastQC app. In the following example, I will request the same data as JSON and will pipe it into the head program to see the first 10 lines:
As with previous examples, the result is a JSON document with an object at the root level; therefore, I can pipe the output into jq .id to extract the app identifier:
I can use dx find projects --public to view a list of public projects. Using head, I can see the root of the JSON is a list:
The jq filter .[] will iterate over the values of a list at the root, so I can use .[].id in the following command to extract the project identifier of each. As this returns over 100 results, I'll use head to show the first few lines:
You can also use pipes inside of the jq filter to extract the same data:
You may wish to re-run an analysis, possibly with slightly different inputs. For this example, I'll use the job.json file rather than using the pipe
Redirect this to a file:
::: note If you had access to the original job ID, you would run the following: :::
Edit the input.json file, perhaps to indicate a different kmer_size, then re-run the app using the new input:
Sometimes I find jobs that some jobs have failed when processing large batches of data. I can use dx find jobs --state failed to return a list of failed jobs that I might see if the input files were corrupt or were especially large, causing the instances to run out of disk space or memory. First, I'll show you how to use more advanced filtering in jq. The file jobs.json shows example output from dx find jobs --json that I'll use to extract the state of the jobs:
A select statement in jq can find the "failed" jobs, and pipes join to more filters to extract the job IDs and the app IDs:
To be useful in a bash loop, I need the job and app IDs on the same line, so I can use paste for this:
If I had access to the original executions and input files, I could use a bash loop to re-run these jobs. Since I don't, I'll echo the command that should be run:
This produces the following output:
If you were using dx find jobs, then the equivalent would be this:
You should now be able to:
Describe how users interact with the DNAnexus Platform
Explain the purpose of using JSON on the DNAnexus platform
Articulate the basic elements of JSON
Describe and read basic JSON structures on the platform
Parse JSON responses from the platform using jq and pipes to other filters or Unix programs
Learn the dxapp.json specification
Use an Editor like Visual Studio Code with JSON Crack plugin
Use JSON checking tools to make sure your JSON is well formed
Run through jq
Use dx get to get app code and dxapp.json for an existing app
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.

version 1.0
task cnvkit_wdl_kyc {
input {
Array[File] bam_tumor
File reference
}
command <<<
cnvkit.py batch \
~{sep=" " bam_tumor} \
-r ~{reference} \
-p $(expr $(nproc) -1) \
-d output/ \
--scatter
>>>
runtime {
docker: "etal/cnvkit:latest"
cpu: 16
}
output {
Array[File]+ cns = glob("output/[!.call]*.cns")
Array[File]+ cns_filtered = glob("output/*.call.cns")
Array[File]+ plot = glob("output/*-scatter.png")
}
}$ cd && wget https://github.com/dnanexus/dxCompiler/releases/download/2.10.3/dxCompiler-2.10.3.jar$ java -jar ~/dxCompiler-2.10.3.jar compile workflow.wdl \
-archive \
-reorg \
-folder /workflows \
-project project-GFf2Bq8054J0v8kY8zJ1FGQF
applet-GFyVxpQ0VGFgGQBy4vJ0kxK2$ dx run applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 -h
usage: dx run applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 [-iINPUT_NAME=VALUE ...]
Applet: cnvkit_wdl_kyc
Inputs:
bam_tumor: [-ibam_tumor=(file) [-ibam_tumor=... [...]]]
reference: -ireference=(file)
Reserved for dxCompiler
overrides___: [-ioverrides___=(hash)]
overrides______dxfiles: [-ioverrides______dxfiles=(file) [-ioverrides______dx>
Outputs:
cns: cns (array:file)
cns_filtered: cns_filtered (array:file)
plot: plot (array:file)$ dx run -y --watch applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 \
-ibam_tumor=file-GFxXjV006kZVQPb20G85VXBp \
-ireference=file-GFxXvpj06kZfP0QVKq2p2FGF \
--destination project-GFyPxb00VGFz5JZQ4f5x424q:/users/kyclark$ cat inputs.json
{
"bam_tumor": [
{
"$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
}
],
"reference": {
"$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
}
}$ dx run -y --watch applet-GFyVxpQ0VGFgGQBy4vJ0kxK2 -f inputs.json \
--destination project-GFyPxb00VGFz5JZQ4f5x424q:/users/kyclark$ dx run -imax_session_length="1d" app-cloud_workstation --ssh -y$ docker pull etal/cnvkit:latest$ docker save etal/cnvkit:latest | gzip - > cnvkit.tar.gz$ dx upload cnvkit.tar.gz --path project-GFyPxb00VGFz5JZQ4f5x424q:/
[===========================================================>]
Uploaded 503,092,072 of 503,092,072 bytes (100%) cnvkit.tar.gz
ID file-GFyq05j0VGFqJqq54q98pbBK
Class file
Project project-GFyPxb00VGFz5JZQ4f5x424q
Folder /
Name cnvkit.tar.gz
State closing
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created Thu Aug 18 03:20:55 2022
Created by kyclark
via the job job-GFypx3Q0VGFgb71g4gYY3GF3
Last modified Thu Aug 18 03:20:57 2022
Media type
archivalState "live"
cloudAccount "cloudaccount-dnanexus"version 1.0
task cnvkit_wdl_tarball {
input {
Array[File] bam_tumor
File reference
}
command <<<
cnvkit.py batch \
~{sep=" " bam_tumor} \
-r ~{reference} \
-p $(expr $(nproc) -1) \
-d output/ \
--scatter
>>>
runtime {
docker: "dx://file-GFyq05j0VGFqJqq54q98pbBK"
cpu: 16
}
output {
Array[File]+ cns = glob("output/[!.call]*.cns")
Array[File]+ cns_filtered = glob("output/*.call.cns")
Array[File]+ plot = glob("output/*-scatter.png")
}
}Input Specification
You will now be prompted for each input parameter to your app. Each parameter
should have a unique name that uses only the underscore "_" and alphanumeric
characters, and does not start with a number.
1st input name (<ENTER> to finish): input_file
Label (optional human-readable name) []: Input file
Your input parameter must be of one of the following classes:
applet array:file array:record file int
array:applet array:float array:string float record
array:boolean array:int boolean hash string
Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: n2nd input name (<ENTER> to finish): quality_score
Label (optional human-readable name) []: Quality score
Choose a class (<TAB> twice for choices): int
This is an optional parameter [y/n]: y
A default value should be provided [y/n]: y
Default value: 30Output Specification
You will now be prompted for each output parameter of your app. Each
parameter should have a unique name that uses only the underscore "_" and
alphanumeric characters, and does not start with a number.
1st output name (<ENTER> to finish): output_file
Label (optional human-readable name) []: Output file
Choose a class (<TAB> twice for choices): file "inputSpec": [
{
"name": "input_file",
"label": "Input file",
"class": "file",
"optional": false,
"patterns": [
"*"
],
"help": ""
},
{
"name": "quality_score",
"label": "Quality score",
"class": "int",
"optional": true,
"default": 30,
"help": ""
}
], {
"name": "input_file",
"label": "Input file",
"class": "file",
"optional": false,
"patterns": [
"*.fastq",
"*.fq"
],
"help": ""
}wget https://github.com/agordon/fastx_toolkit/releases/download/0.0.14/fastx_toolkit-0.0.14.tar.bz2tar xvf fastx_toolkit-0.0.14.tar.bz2x ./bin/fasta_clipping_histogram.pl
x ./bin/fasta_formatter
x ./bin/fasta_nucleotide_changer
x ./bin/fastq_masker
x ./bin/fastq_quality_boxplot_graph.sh
x ./bin/fastq_quality_converter
x ./bin/fastq_quality_filter
x ./bin/fastq_quality_trimmer
x ./bin/fastq_to_fasta
x ./bin/fastx_artifacts_filter
x ./bin/fastx_barcode_splitter.pl
x ./bin/fastx_clipper
x ./bin/fastx_collapser
x ./bin/fastx_nucleotide_distribution_graph.sh
x ./bin/fastx_nucleotide_distribution_line_graph.sh
x ./bin/fastx_quality_stats
x ./bin/fastx_renamer
x ./bin/fastx_reverse_complement
x ./bin/fastx_trimmer
x ./bin/fastx_uncollapsermkdir -p mytrimmer/resources/usr/bin/cp PATH_TO_FASTX/fastq_quality_trimmer mytrimmer/resources/usr/bin/#!/bin/bash
set -exuo pipefail
main() {
echo "Value of input_file: '$input_file'"
echo "Value of quality_score: '$quality_score'"
dx download "$input_file" -o "$input_file_name"
outfile="${input_file_prefix}.filtered.fastq"
fastq_quality_trimmer -Q 33 -t ${quality_score} -i "$input_file_name" -o "$outfile"
outfile_id=$(dx upload $outfile --brief)
dx-jobutil-add-output output_file "$outfile_id" --class=file
}wget https://dl.dnanex.us/F/D/Bp43z7pb2JX8jpB035j4424Vp4Y6qpQ6610ZXg5F/small-celegans-sample.fastq dx upload small-celegans-sample.fastq[===========================================================>]
Uploaded 16,801,690 of 16,801,690 bytes (100%) small-celegans-sample.fastq
ID file-GJ2k2V80vx88z3zyJbVXZj3G
Class file
Project project-GJ2k24j0vx804FPyBbxqpQBk
Folder /
Name small-celegans-sample.fastq
State closing
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created Tue Oct 11 08:52:37 2022
Created by kyclark
Last modified Tue Oct 11 08:52:53 2022
Media type
archivalState "live"
cloudAccount "cloudaccount-dnanexus"$ dx build mytrimmer -f
{"id": "applet-GJ2k5780vx804FPyBbxqpQQ0"}$ dx run applet-GJ2k5780vx804FPyBbxqpQQ0 -h
usage: dx run applet-GJ2k5780vx804FPyBbxqpQQ0 [-iINPUT_NAME=VALUE ...]
Applet: FastQTrimmer
mytrimmer
Inputs:
Input file: -iinput_file=(file)
Quality score: [-iquality_score=(int, default=30)]
Outputs:
Output file: output_file (file)$ dx run applet-GJ2k5780vx804FPyBbxqpQQ0 \
> -iinput_file=file-GJ2k2V80vx88z3zyJbVXZj3G -y --watch
Using input JSON:
{
"input_file": {
"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"
}
}
Calling applet-GJ2k5780vx804FPyBbxqpQQ0 with output destination
project-GJ2k24j0vx804FPyBbxqpQBk:/
Job ID: job-GJ2k5F00vx84k2X3BqqZ5Zpp
Job Log
-------
Watching job job-GJ2k5F00vx84k2X3BqqZ5Zpp. Press Ctrl+C to stop watching.2022-10-11 16:31:18 FastQTrimmer STDERR + echo 'Value of input_file:
'\''{"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}'\'''
2022-10-11 16:31:18 FastQTrimmer STDERR + echo 'Value of quality_score:
'\''30'\'''
2022-10-11 16:31:18 FastQTrimmer STDOUT Value of input_file:
'{"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"}'
2022-10-11 16:31:18 FastQTrimmer STDOUT Value of quality_score: '30'
2022-10-11 16:31:18 FastQTrimmer STDERR + dx download '{"$dnanexus_link":
"file-GJ2k2V80vx88z3zyJbVXZj3G"}' -o small-celegans-sample.fastq
2022-10-11 16:31:19 FastQTrimmer STDERR + outfile=
small-celegans-sample.filtered.fastq
2022-10-11 16:31:19 FastQTrimmer STDERR + fastq_quality_trimmer -Q 33
-t 30 -i small-celegans-sample.fastq -o small-celegans-sample.filtered.fastq
2022-10-11 16:31:27 FastQTrimmer STDERR ++ dx upload
small-celegans-sample.filtered.fastq --brief
2022-10-11 16:31:28 FastQTrimmer STDERR + outfile_id=
file-GJ2zkYj06GbzP8XBB4bVGxQ6
2022-10-11 16:31:28 FastQTrimmer STDERR + dx-jobutil-add-output output_file
file-GJ2zkYj06GbzP8XBB4bVGxQ6 --class=file$ dx download file-GJ2k73j08bbkVxK9Gxx8Z891
[===========================================================>]
Completed 15,557,666 of 15,557,666 bytes (100%) .../fastq_trimmer/small-celegans-sample.filtered.fastq
$ wc -l small-celegans-sample.f*
100000 small-celegans-sample.fastq
99848 small-celegans-sample.filtered.fastq
199848 totalnextflow.enable.dsl=2
//params.fastq_dir will be exposed as a pipeline input and is given a default here
params.fastq_dir = "./FASTQ/*.fq.gz"
//make a fastq ch
fastq_ch = Channel.fromPath(params.fastq_dir)
workflow {
//fastqc
// takes in a fastq_ch and outputs a channel with fastqc html and zip files
raw_fastqc_ch = fastqc(fastq_ch)
//takes in a fastq_ch and outputs a channel with trimmed reads
trimmed_reads_ch = read_trimming(fastq_ch)
//takes in the trimmed reads channel this time
trimmed_fastqc_ch = fastqc_trimmed(trimmed_reads_ch)
//combine the two channels together to use them in multiqc
combined_fastqc_ch = raw_fastqc_ch.mix(trimmed_fastqc_ch)
//takes in a channel containing fastqc files
//collect is used here to make all files available at the same time.
multiqc(combined_fastqc_ch.collect())
}nextflow run main.nf --fastq_dir "/FASTQ/SRR_*.fastq.gz"nextflow run main.nf{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
},
"stats_txt": {
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
}
}{
"name": "fastqc",
"title": "FastQC Reads Quality Control",
"summary": "Generates a QC report on reads data",
"dxapi": "1.0.0",
"openSource": true,
"version": "3.0.3",
"inputSpec": [
{
"name": "reads",
"label": "Reads",
"help": "A file containing the reads to be checked. Accepted formats are gzipped-FASTQ and BAM.",
"class": "file",
"patterns": [
"*.fq.gz",
"*.fastq.gz",
"*.sam",
"*.bam"
]
},
...
}>>> { 'patterns': [ '*.bam', '*.sam', ] }
{'patterns': ['*.bam', '*.sam']}{ "patterns": [ "*.bam", "*.sam", ] }{
"patterns": ["*.bam", "*.sam", ]
}Error: Parse error on line 2:
... ["*.bam", "*.sam", ]}
-----------------------^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got ']'$ jsonlint dxapp.json
Error: Parse error on line 15:
...*.sam", ], "help
----------------------^
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got ']'$ cat minified.json
{"report_html":{"dnanexus_link":"file-G4x7GX80VBzQy64k4jzgjqgY"},"stats_txt":
{"dnanexus_link":"file-G4x7GXQ0VBzZxFxz4fqV120B"}}$ jq . minified.json
{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
},
"stats_txt": {
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
}
}$ jq .report_html example.json
{
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
}$ jq .report_htm example.json
null$ jq .report_html.dnanexus_link example.json
"file-G4x7GX80VBzQy64k4jzgjqgY"$ jq . minified.json > prettified.json
$ cat prettified.json
{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
},
"stats_txt": {
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
}
}$ cat minified.json | jq .
{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
},
"stats_txt": {
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
}
}$ jq . < example.json
{
"report_html": {
"dnanexus_link": "file-G4x7GX80VBzQy64k4jzgjqgY"
},
"stats_txt": {
"dnanexus_link": "file-G4x7GXQ0VBzZxFxz4fqV120B"
}
}$ dx describe app-fastqc --json | head
{
"id": "app-G81jg5j9jP7qxb310vg2xQkX",
"class": "app",
"billTo": "org-dnanexus_apps",
"created": 1644399511000,
"modified": 1644401066806,
"createdBy": "user-jkotrs",
"name": "fastqc",
"version": "3.0.3",
"aliases": [$ dx describe app-fastqc --json | jq .id
"app-G81jg5j9jP7qxb310vg2xQkX"$ dx find projects --public --json | head
[
{
"id": "project-F0yyz6j9Jz8YpxQV8B8Kk7Zy",
"level": "VIEW",
"permissionSources": [
"PUBLIC"
],
"public": true,
"describe": {
"id": "project-F0yyz6j9Jz8YpxQV8B8Kk7Zy",$ dx find projects --public --json | jq ".[].id" | head -3
"project-F0yyz6j9Jz8YpxQV8B8Kk7Zy"
"project-G4FX3QXKzJxqXxGpK2pJ7Z3K"
"project-FGX8gVQB9X7K5f1pKfPvz9yG"$ dx find projects --public --json | jq ".[] | .id" | head -n 3
"project-F0yyz6j9Jz8YpxQV8B8Kk7Zy"
"project-G4FX3QXKzJxqXxGpK2pJ7Z3K"
"project-FGX8gVQB9X7K5f1pKfPvz9yG"$ jq .input job.json
{
"reads": {
"$dnanexus_link": "file-BQbXKk80fPFj4Jbfpxb6Ffv2"
},
"format": "auto",
"kmer_size": 7,
"nogroup": true
}$ jq .input job.json > input.json$ dx describe job-G4x7G5j0B3K2FKzgP654ZqpK --json | jq .input > input.json$ dx run app-G4YyQ9044b90F1vG8y9YkKk3 -f input.json$ jq ".[].state" rap-jobs.json | sort | uniq -c | sort -rn
15 "failed"
3 "done"
2 "terminated"$ jq '.[] | select (.state | contains("failed")) | .id, .executable' rap-jobs.json | head
"job-G6jj9k8JPXfG42094KG5JFX4"
"applet-G6jj9b0JPXf5Q6ZF4G85K156"
"job-G6jj1zQJPXf34z8v4KqjZKP1"
"applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
"job-G6jg9vQJPXfGbJb54GFkJ33Y"
"applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
"job-G6jg7Y0JPXfG6q53G12vQZK8"
"applet-G6jg6pQJPXf7ypXq33B75Qq1"
"job-G6jg57QJPXf90Jjv4K8pgkG7"
"applet-G6jfg90JPXfGZkVb7PPxjpPY"$ jq '.[] | select (.state | contains("failed")) | .id, .executable' rap-jobs.json | paste - -
"job-G6jj9k8JPXfG42094KG5JFX4" "applet-G6jj9b0JPXf5Q6ZF4G85K156"
"job-G6jj1zQJPXf34z8v4KqjZKP1" "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
"job-G6jg9vQJPXfGbJb54GFkJ33Y" "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp"
"job-G6jg7Y0JPXfG6q53G12vQZK8" "applet-G6jg6pQJPXf7ypXq33B75Qq1"
"job-G6jg57QJPXf90Jjv4K8pgkG7" "applet-G6jfg90JPXfGZkVb7PPxjpPY"
"job-G6jZk6jJPXf1q1Py5VKX6gJK" "applet-G6jZjG0JPXf7ZxZP4G5v0X1k"
"job-G6jYY28JPXfFvFXY4GXB6jG2" "applet-G6jYXq0JPXf5Q6ZF4G85JVgG"
"job-G6jY9FQJPXf3pj894GFJ02jy" "applet-G6jY7zQJPXfG42094KG5Gkyy"
"job-G6jY858JPXfBKX1X0j434BY5" "applet-G6jY7zQJPXfG42094KG5Gkyy"
"job-G6jY740JPXf7V2vJ4G2Gkfj7" "applet-G6jY6zQJPXf81J984K6kfB3V"
"job-G6jY5v8JPXfPGQq15k77zPJ9" "applet-G6jY5jjJPXf6Ffqg4GqF4KPg"
"job-G6jY4k0JPXfPGQq15k77zP9Q" "applet-G6jY39jJPXfG42094KG5GkV9"
"job-G6jXPJQJPXfBbf694G3Fg07K" "applet-G6jXJJjJPXf7V2vJ4G2GkFbF"
"job-G6jX7yQJPXfFjzffKJzpqfj7" "applet-G6jX7JQJPXf3V99x4Gx7K09X"
"job-G6jVzJ0JPXf5Q6ZF4G85JG09" "applet-G6jVxQQJPXfGZ0BF33KZfX5Y"jq '.[] | select (.state | contains("failed")) | .id, .executable' \
rap-jobs.json | paste - - | \
while read JOB_ID APP_ID; do echo dx run $APP_ID --clone $JOB_ID; donedx run "applet-G6jj9b0JPXf5Q6ZF4G85K156" --clone "job-G6jj9k8JPXfG42094KG5JFX4"
dx run "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp" --clone "job-G6jj1zQJPXf34z8v4KqjZKP1"
dx run "applet-G6jg9p8JPXf4Q9Pb4GgPK8Vp" --clone "job-G6jg9vQJPXfGbJb54GFkJ33Y"
dx run "applet-G6jg6pQJPXf7ypXq33B75Qq1" --clone "job-G6jg7Y0JPXfG6q53G12vQZK8"
dx run "applet-G6jfg90JPXfGZkVb7PPxjpPY" --clone "job-G6jg57QJPXf90Jjv4K8pgkG7"
dx run "applet-G6jZjG0JPXf7ZxZP4G5v0X1k" --clone "job-G6jZk6jJPXf1q1Py5VKX6gJK"
dx run "applet-G6jYXq0JPXf5Q6ZF4G85JVgG" --clone "job-G6jYY28JPXfFvFXY4GXB6jG2"
dx run "applet-G6jY7zQJPXfG42094KG5Gkyy" --clone "job-G6jY9FQJPXf3pj894GFJ02jy"
dx run "applet-G6jY7zQJPXfG42094KG5Gkyy" --clone "job-G6jY858JPXfBKX1X0j434BY5"
dx run "applet-G6jY6zQJPXf81J984K6kfB3V" --clone "job-G6jY740JPXf7V2vJ4G2Gkfj7"
dx run "applet-G6jY5jjJPXf6Ffqg4GqF4KPg" --clone "job-G6jY5v8JPXfPGQq15k77zPJ9"
dx run "applet-G6jY39jJPXfG42094KG5GkV9" --clone "job-G6jY4k0JPXfPGQq15k77zP9Q"
dx run "applet-G6jXJJjJPXf7V2vJ4G2GkFbF" --clone "job-G6jXPJQJPXfBbf694G3Fg07K"
dx run "applet-G6jX7JQJPXf3V99x4Gx7K09X" --clone "job-G6jX7yQJPXfFjzffKJzpqfj7"
dx run "applet-G6jVxQQJPXfGZ0BF33KZfX5Y" --clone "job-G6jVzJ0JPXf5Q6ZF4G85JG09"dx find jobs --state failed --json | jq '.[] | .id, .executable' | paste - - | \
while read JOB_ID APP_ID; do echo dx run $APP_ID --clone $JOB_ID; doneThe column is empty
Column type
Description
Example
string
A string column has free-text values. This is the default fallback type when Data Profiler fails to cast a column type.
Patient’s name; Patient’s ID
integer
An integer column has integer values.
Number of children
float
A float column has float values.
Weight; Height
datetime
A float column has float values. The default time zone is UTC.
Date of birth
unknown
Dataset-level screen is the default screen when you open Data Profiler. It has the Table Relationship and Table Summary pages. In this section, we describe each component of the screen and its key values.
The default screen of Data Profiler is at the Table Relationships page of the Dataset level
The Manage Tables controller allows you to hide/show the table(s) from the data profile. The table(s) which are hidden from the ERD will also be hidden from the Data Hierarchy. In order to manage the table display, click on the ‘Manage’ button on the bottom right corner of the screen, then use the toggle to hide/show the tables, and click on the ‘Apply’ button to apply the changes.
Open the ‘Manage Tables’ controller to show/hide the table(s)
The data profile is updated after the ‘patients’ table is hidden
A Relationship Diagram (left) with some selected edges highlighted in blue. The selected edges create a Diagram of Overlaps (right)
This is a simplified Entity Relationship Diagram displayed as a graph. Each node represents a table in your dataset, and each edge represents a column that links two tables. The linked columns are the referenced_entity_field in the data_dictionary. The direction of an edge represents the reference from a foreign-key column to a primary-key column
FAQs
Question: There are tables supposed to be linked to each other. Why do they appear unlinked in Data Profiler?
Answer: The linkage between any two tables are determined by the data_dictionary. Data Profiler does not remove or add linkages to a dataset. You should check your data_dictionary again and make sure that the linkage is correctly specified.
By clicking on one or more edges, you can view a Diagram of Overlaps that shows how many values the linked columns share between the tables. There are several chart types for a Diagram of Overlaps:
Venn diagram is the default chart type of Diagram of Overlaps. Each set in this diagram is a table in the selection. The numbers are the values from the column in the selection.
Question: How should I interpret a Venn diagram having 2 tables, patients and measurements, and the value of their intersection is 90? The column is patient_id.
Answer: When patients and measurement tables share some patient_ids, It basically means there are 90 patients having measurements data.
Euler diagrams share the same concept with Venn diagrams. The only difference is the size of overlap sections are proportional to the overlap value.
Upset plot counts the value of all non-empty possible combinations from the selected tables. This plot type is more scalable than the Venn or Euler diagram.
A common use case of Upset plot is to help answer questions such as “How many patients have full information across tables?”. By creating an Upset plot between the “patients” table and other tables (e.g. diagnosis, measurement, sequence_run, etc.), we can answer the questions by looking at the number of patient ids that are shared across all tables.
The Summary page provides summary for both tables and columns in the Dataset. Below are the details of each section.
The summary of all Tables and Columns in the Dataset
The Table Summary shows information about all tables in the dataset. Each row displays various statistics for a table in your dataset, including:
# Columns, # Rows: the number of columns, the number of rows
Column types: data type of all columns in a table
Duplication Rate: the rate of duplication of a whole row in the table
Missing Rate: the rate of having an empty cell in the table
You can click on the hamburger button at the header of each column to sort or filter the data as needed.
Clicking on the hamburger button to sort or filter the data
The Column Summary provides details about every column in the dataset, with each row presenting below information for a specific column.
Column name: name of the column
Key type: the attributes that are used to define the relationships of tables
Description: the title of a column (if provided in the data dictionary file)
Provided type: the type of data in the column which is specified in the data dictionary file. If the data dictionary is not provided, it is ‘unknown’
Inferred types: the type of data in the column inferred by Data Profiler if the data dictionary is not provided. If the data dictionary is provided, it will be the same as the Provided type
Missing Rate: the rate of having an empty cell in a column
Duplication Rate: the rate of duplication of values in a column
You can also click on the hamburger button at the header of each column to sort or filter the data as needed.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.
Expression
TCGA (via GDC)
Data is publicly available (RNA-Seq, STAR - Counts) from GDC from this page and is downloaded on May 16, 2025.
Somatic
TCGA (via GDC/cBioPortal)
Derived from public SNV, CNV, and Fusion data:
SNV data are publicly available and downloaded from GDC on October 17, 2024.
CNV Segmented copy number data (.SEG files) are publicly available and were downloaded from GDC on October 6, 2025.
Fusion data are publicly available and downloaded from cBioPortal on September 27, 2025.
Germline
Synthetic Data Only
TCGA germline data is not publicly available. This component uses simulated genotypes.
You can use both the phenotypical and genomic data when creating a cohort.
The phenotypic data (which is one database) is processed and combined with the genomic data (another database) to ensure that they are paired appropriately, and that forms a dataset.
You can then use the dataset in Apollo to perform various actions, such as visualizing the data, analyzing all of part of it (called a cohort), and collaborate with others about a particular dataset
Each dataset has an important structure.
First, a data set lies on top of a database. A data set can be copied and moved around the platform, and even deleted. A database, however, cannot without the ingestion process having to be repeated.
Datasets are the top level structure of the data.
Each dataset has entities, which are equivalent to tables. The tables contain fields.
Fields are the variables.
The graphic below also explains the relationship:
Data sets are patient- centric. All the information goes back to the patient.
This is important for filtering. If a patient, for example, takes a medication more than once during the progression of their illness, there will be more instance types for the medication than there are people in the cohort.
Here is a summary graphic of how the data is considered to be patient- centric:
Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.
A dataset can be thought of as a giant multi-omics matrix.
Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across omics type
Underlying Apollo is a technology called Spark. All data in Apollo is stored in it.
It is made to handle very large datasets and enable fast queries that can't be handled by single computers.
It does this by creating RDDs (resilient distributed datasets), which are distributed across the worker nodes. Each node handles only part of the query and reports its back, which is why the queries are very fast.
Details about RDDs can be found and
Spark databases mean you can query across many columns in the dataset relatively quickly, compared to using a single computer.
Once data is ingested, they are available as separate Spark databases. Apollo unifies accessing data in these databases through what's called a dataset.
A dataset can be thought of as a giant multi-omics matrix,
Datasets can be further refined into Cohorts within the Apollo Interface, allowing complex queries across genomics type
Assay Type
Source
Notes
Clinical
TCGA (via cBioPortal)
Data is publicly available ("full" 32 studies) from cBioPortal on October 17, 2024.









In this example, you will:
Learn to write a native DNAnexus applet that executes a Python program
Use the dxpy module to download and upload files
Use the Python subprocess module to execute an external process and check the return value
We'll use the same scarlet.txt file from the bash version of the wc applet. Start off using dx-app-wizard and define the same inputs and outputs as before, but be sure to choose Python for the Programming language:
The Python template looks like the following:
: DNAnexus execution environment entry point
The input_file listed in the inputSpec is passed to main.
Create a object.
Update src/python_wc.py to the following:
Import the function.
Use the local filename input_file.txt.
The output file will be called output.txt.
Shadow the input_file variable, overwriting it with the creation of a new
NOTE: Portable Operating System Interface (POSIX) standards dictate that processes return 0 on success (i.e., zero errors) and some positive integer value (usually in the range 1-127) to indicate an error condition.
Run dx build to build the applet. Create an job_input.json file with the file ID of your input:
Run your applet with the input file using --watch to see the output:
I can inspect the contents of the output file:
I can verify this is correct by piping the input file to a local execution of wc:
You can shorten the build/run development cycle by naming the JSON input job_input.json and executing the Python program locally:
This will download the input as input_file.txt and then create a new local file with the system call:
You have now translated the bash applet for running wc into a native DNAnexus Python applet.
You were introduced to the dxpy module that provides functions for making API calls.
You used subprocess.getstatusoutput to call an external process and interpret the return value for success or failure.
In the next section, we'll continue translating bash to Python.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
In this example, you will translate the bash app from the previous chapter into Workflow Definition Language (WDL).
You will learn how to:
Use Java Jar files to validate and compile WDL
Use WDL to define an applet's inputs, outputs, and runtime specs
Compile a WDL task into an applet
You will not use a wizard to start this applet, so manually create a directory for your work. Create a file called fastq_trimmer.wdl with the following contents:
This line indicates that the WDL follows the .
The task defines the body of the applet.
The input block defines the same inputs, a File called input_file and an Int (integer) value called quality_score with a default value of 30.
To start, validate your WDL with WOMtool:
Before compiling the WDL into an applet, use dx pwd to ensure you are in your desired project. If not, run dx select to select a different project, then use the following command to compile the applet:
Use dx run as in the previous chapter to run the applet with the -h|--help option to that the usage looks identical to the bash version:
From the perspective of the user, there is no difference between native/bash applets and those written in WDL. You should use whichever syntax you find most convenient to the task at hand. For instance, this applet leverages an existing Docker container created by the rather than adding the binary as a resource.
You can run the applet using the command-line arguments as shown, or you can create a JSON file with the arguments as follows:
You can run the applet and watch the job with the following command:
The output will look quite different from the bash app, but the basics are still the same. In this version, notice that you do not need to download the inputs or upload the outputs. Once the input files are in place, the command block is run and the input files and variables are dereferenced properly. When the job has completed, run dx describe to see the inputs and outputs:
Download the output file to ensure it looks like a correct result:
You may find it useful to create a Makefile with all the steps documented in a runnable fashion:
Now you can run make compile rather than type out the rather long Java command.
The WDL version of the FastQTrimmmer applet is arguable simpler than the bash version. It uses just one file, fastq_trimmer.wdl, and about 20 lines of text, whereas the bash version requires at least dxapp.json, a bash script, and the resources tarball.
In this chapter, you learned how to:
Use a Biocontainers Docker image for the necessary binary executables from FASTX toolkit
Define the same inputs, outputs, and commands as the bash applet from Chapter 3
Use a Makefile to define project shortcuts to validate, compile, and run an applet
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
This tutorial uses the same samtools applet from but will be using a public Docker Image instead of an asset.
Please start the Cloud Workstation Application by typing in the following command into the terminal:
Once the Cloud Workstation Application has started, pull the image from the repository, save the Docker image within the Workstation, and then use dx upload to put the saved image onto the project space.
First, pull the Docker Image using the following command:
A license is required to access the Data Profiler on the DNAnexus Platform. For more information, please contact DNAnexus Sales (via [email protected]).
The Data Profiler is an app within the DNAnexus Tool Library that supports data cleaning and harmonization. It organizes your data into three levels of information: Dataset level, Table level, and Column level. Each level surfaces interactive visualizations on data quality, data coverage, and descriptive statistics to help you understand and identify potential data issues. The Data Profiler also includes an Explorer Mode where you can create customizable visualization using simple drag-and-drop functionality, for deeper exploration beyond the standard metrics. Researchers can bring their data to the Platform and leverage the Data Profiler app to explore and quickly evaluate the readiness of the data for downstream analysis.
In this exercise, we'll demonstrate a native DNAnexus Python applet that runs the fastq_quality_trimmer binary.
You will learn:
How to use a DXFile object to get file metadata
How to use Python functions to choose an output filename using the input file's name





Upload the local output file.
Add the DX file ID to the output dictionary.
Return the output
Call dxpy.download_dxfile to download the input file identified by the file ID to the local_file name.
Execute wc on the local input file and redirect (>) the output to the chosen output filename. This function returns a tuple containing the process's return value and output (STDOUT/STDERR).
If the return value is not zero, use sys.exit to abort the program with the output from the system call.
If the program makes it to this point, the output file should have been created to upload.
Return a Python dictionary with the DNAnexus link to the new outfile object.
This line defines a variable called basename which uses the basename function to get the filename of the input file.
The command block will be executed at runtime. It uses the tilde/twiddle syntax (~{}) to derefence variables. The output is written to a filename using the basename of the input.
The output defines a single File called output_file.
The runtime specifies a Biocontainers/Docker that contains the FASTX toolkit binaries.
The inputs and outputs are the same as in the bash version of this applet. You can start from scratch using dx-app-wizard with the following input specs:
input_file
file
No
NA
quality_score
file
Yes
30
The output specs are as follows:
output_file
file
Or you can use the dxapp.json from the bash version and change the runSpec file to the name of your Python script and the interpreter to python3 as follows:
Inside your applet's source code, create resources/usr/local/bin and copy the fastq_quality_trimmer bin to this location. At runtime, the binary will be available at /usr/local/bin/fastq_quality_trimmer, which is in the standard $PATH.
Update the Python code to the following:
The input_file will be the DNAnexus file ID (e.g., file-FvQGZb00bvyQXzG3250XGbgz), and the quality_score will be an integer value.
Use DXFile.describe to get a Python dictionary of metadata.
Choose a local filename by using either the file's name from the metadata or the file ID.
Download the input file to the chosen local filename.
Split the filename into a basename and extension.
Create an output filename using the input basename and a new extension to indicate that the data has been filtered.
Format a command string.
Print the command for debugging purposes.
Execute the command and check the return value.
If the code makes it to this point, upload the output file and return the file ID to be attached to the job's output.
Run dx build in your source directory to create the new applet. Use the new applet ID to execute the applet with a small FASTQ file:
Use dx head to verify the output looks like a FASTQ file:
To verify that the applet did winnow the number of reads, I can pipe the output of dx cat to wc to verify that the output file has fewer lines than the input file:
You used DXFile to get the input file's name
Your output filename is based on the input file's name rather than a static name like output.txt.
You can call Python's print function to add your own STDOUT/STDERR to the applet, which can be an aid in debugging your program.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Template Options
You can write your app in any programming language, but we provide
templates for the following supported languages: Python, bash
Programming language: Python#!/usr/bin/env python
# python_wc 0.1.0
# Generated by dx-app-wizard.
#
# Basic execution pattern: Your app will run on a single machine from
# beginning to end.
#
# See https://documentation.dnanexus.com/developer for documentation and
# tutorials on how to modify this file.
#
# DNAnexus Python Bindings (dxpy) documentation:
# http://autodoc.dnanexus.com/bindings/python/current/
import os
import dxpy
@dxpy.entry_point('main') # 1
def main(input_file): # 2
# The following line(s) initialize your data object inputs on the platform
# into dxpy.DXDataObject instances that you can start using immediately.
input_file = dxpy.DXFile(input_file) # 3
# The following line(s) download your file inputs to the local file system
# using variable names for the filenames.
dxpy.download_dxfile(input_file.get_id(), "input_file") # 4
# Fill in your application code here.
# The following line(s) use the Python bindings to upload your file outputs
# after you have created them on the local file system. It assumes that you
# have used the output field name for the filename for each output, but you
# can change that behavior to suit your needs.
outfile = dxpy.upload_local_file("outfile") # 5
# The following line fills in some basic dummy output and assumes
# that you have created variables to represent your output with
# the same name as your output fields.
output = {}
output["outfile"] = dxpy.dxlink(outfile) # 6
return output # 7
dxpy.run()#!/usr/bin/env python
import dxpy
import sys
from subprocess import getstatusoutput # 1
@dxpy.entry_point("main")
def main(input_file):
local_file = "input_file.txt" # 2
output_file = "output.txt" # 3
input_file = dxpy.DXFile(input_file) # 4
dxpy.download_dxfile(input_file.get_id(), local_file) # 5
rv, out = getstatusoutput(f"wc {local_file} > {output_file}") # 6
if rv != 0: # 7
sys.exit(out)
outfile = dxpy.upload_local_file(output_file) # 8
return {"outfile": dxpy.dxlink(outfile)} # 9
dxpy.run(){
"input_file": {
"$dnanexus_link": "file-GgGX7Y8071x46B02JGb515pB"
}
}$ dx run applet-GgGX740071xJY20Gjkp0JYXB -f python_wc/job_input.json \
-y --watch \
--destination project-GXY0PK0071xJpG156BFyXpJF:/output/python_wc/
Using input JSON:
{
"input_file": {
"$dnanexus_link": "file-GgGX7Y8071x46B02JGb515pB"
}
}
Calling applet-GgGX740071xJY20Gjkp0JYXB with output destination
project-GXY0PK0071xJpG156BFyXpJF:/output/python_wc
Job ID: job-GgGX8P0071x1yfFPkJ8662gQ
Job Log
-------
Watching job job-GgGX8P0071x1yfFPkJ8662gQ. Press Ctrl+C to stop watching.
* Python implementation of wc (python_wc:main) (running) job-GgGX8P0071x1yfFPkJ8662gQ
kyclark 2024-02-23 16:03:24 (running for 0:01:39)
2024-02-23 16:11:36 Python implementation of wc INFO Logging initialized (priority)
2024-02-23 16:11:36 Python implementation of wc INFO Logging initialized (bulk)
2024-02-23 16:11:40 Python implementation of wc INFO Setting SSH public key
2024-02-23 16:11:42 Python implementation of wc STDOUT dxpy/0.369.0 (Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-23 16:11:43 Python implementation of wc STDOUT Invoking main with {'input_file': {'$dnanexus_link': 'file-GgGX7Y8071x46B02JGb515pB'}}
* Python implementation of wc (python_wc:main) (done) job-GgGX8P0071x1yfFPkJ8662gQ
kyclark 2024-02-23 16:03:24 (runtime 0:01:36)
Output: outfile = file-GgGXGFj0FbZxjvk1jZPJPkG2$ dx cat file-GgGXGFj0FbZxjvk1jZPJPkG2
8596 86049 513778 input_file.txt$ dx cat file-GgGX7Y8071x46B02JGb515pB | wc
8596 86049 513778$ python3 src/python_wc.py
Invoking main with {'input_file': {'$dnanexus_link': 'file-GgGX7Y8071x46B02JGb515pB'}}$ cat output.txt
8596 86049 513778 input_file.txtversion 1.0
task fastq_trimmer {
input {
File input_file
Int quality_score = 30
}
String basename = basename(input_file)
command <<<
fastq_quality_trimmer -Q 33 -t ~{quality_score} \
-i ~{input_file} -o ~{basename}.filtered.fastq
>>>
output {
File output_file = "~{basename}.filtered.fastq"
}
runtime {
docker: "biocontainers/fastxtools:v0.0.14_cv2"
}
}$ java -jar ~/womtool.jar validate fastq_trimmer.wdl
Success!$ java -jar ~/dxCompiler.jar compile fastq_trimmer.wdl
[warning] Project is unspecified...using currently selected project project-GJ2k24j0vx804FPyBbxqpQBk
applet-GJ2pgv80vx84zJ4XJF6GPXz7usage: dx run applet-GJ2pgv80vx84zJ4XJF6GPXz7 [-iINPUT_NAME=VALUE ...]
Applet: fastq_trimmer
Inputs:
input_file: -iinput_file=(file)
quality_score: [-iquality_score=(int, default=30)]
Reserved for dxCompiler
overrides___: [-ioverrides___=(hash)]
overrides______dxfiles: [-ioverrides______dxfiles=(file) [-ioverrides______dxfiles=... [...]]]
Outputs:
output_file: output_file (file)$ cat inputs.json
{
"input_file": {
"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"
},
"quality_score": 35
}$ dx run applet-GJ2pgv80vx84zJ4XJF6GPXz7 -f inputs.json -y --watch
Using input JSON:
{
"input_file": {
"$dnanexus_link": "file-GJ2k2V80vx88z3zyJbVXZj3G"
},
"quality_score": 35
}
Calling applet-GJ2pgv80vx84zJ4XJF6GPXz7 with output destination
project-GJ2k24j0vx804FPyBbxqpQBk:/
Job ID: job-GJ2ppvQ0vx88k8bv9pvGyjGX
Job Log
-------
Watching job job-GJ2ppvQ0vx88k8bv9pvGyjGX. Press Ctrl+C to stop watching.$ dx describe job-GJ2ppvQ0vx88k8bv9pvGyjGX
Result 1:
ID job-GJ2ppvQ0vx88k8bv9pvGyjGX
Class job
Job name fastq_trimmer
Executable name fastq_trimmer
Project context project-GJ2k24j0vx804FPyBbxqpQBk
Region aws:us-east-1
Billed to org-sos
Workspace container-GJ2ppx80773k09b8F6qKGJBb
Applet applet-GJ2pgv80vx84zJ4XJF6GPXz7
Instance Type mem1_ssd1_v2_x2
Priority high
State done
Root execution job-GJ2ppvQ0vx88k8bv9pvGyjGX
Origin job job-GJ2ppvQ0vx88k8bv9pvGyjGX
Parent job -
Function main
Input input_file = file-GJ2k2V80vx88z3zyJbVXZj3G
quality_score = 35
Output output_file = file-GJ2pv300773ypy03Jg2vYZ9f
...$ dx download file-GJ2pv300773ypy03Jg2vYZ9f
[===========================================================>]
Completed 14,357,774 of 14,357,774 bytes (100%) ~/fastq_trimmer_wdl/small-celegans-sample.fastq.filtered.fastq
$ wc -l small-celegans-sample.fastq.filtered.fastq
98624 small-celegans-sample.fastq.filtered.fastqWDL = fastq_trimmer.wdl
PROJECT_ID = project-GJ2k24j0vx804FPyBbxqpQBk
DXCOMPILER = java -jar ~/dxCompiler.jar
CROMWELL = java -jar ~/cromwell.jar
WOMTOOL = java -jar ~/womtool.jar
WORKFLOW_ID = applet-GJ2pgv80vx84zJ4XJF6GPXz7
validate:
$(WOMTOOL) validate $(WDL)
check:
miniwdl check $(WDL)
compile:
$(DXCOMPILER) compile $(WDL) \
-archive \
-folder /workflows \
-project $(PROJECT_ID)
run:
dx run $(WORKFLOW_ID) \
-f inputs.json \
--destination $(PROJECT_ID):/output \
-y --watch "runSpec": {
"timeoutPolicy": {
"*": {
"hours": 1
}
},
"interpreter": "python3",
"file": "src/python_fastq_trimmer.py",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0"
},#!/usr/bin/env python3
import dxpy
import os
import sys
from subprocess import getstatusoutput
@dxpy.entry_point("main")
def main(input_file, quality_score): # 1
input_file = dxpy.DXFile(input_file)
desc = input_file.describe() # 2
local_file = desc.get("name", input_file.get_id()) # 3
dxpy.download_dxfile(input_file.get_id(), local_file) # 4
basename, ext = os.path.splitext(local_file) # 5
outfile = f"{basename}.filtered{ext}" # 6
cmd = ( # 7
f"fastq_quality_trimmer -Q 33 -t {quality_score} "
f"-i {local_file} -o {outfile}"
)
print(cmd) # 8
rv, out = getstatusoutput(cmd) # 9
if rv != 0:
sys.exit(out)
dx_output_file = dxpy.upload_local_file(outfile) # 10
return {"output_file": dxpy.dxlink(dx_output_file)}
dxpy.run()$ dx run applet-GgKQ5qQ071x5yX7fgbq3PkXB \
> -f python_fastq_trimmer/job_input.json -y --watch \
> --destination project-GXY0PK0071xJpG156BFyXpJF:/output/python_fastq_trimmer/
Using input JSON:
{
"input_file": {
"$dnanexus_link": "file-FvQGZb00bvyQXzG3250XGbgz"
},
"quality_score": 28
}
Calling applet-GgKQ5qQ071x5yX7fgbq3PkXB with output destination
project-GXY0PK0071xJpG156BFyXpJF:/output/python_fastq_trimmer
Job ID: job-GgKQ6x0071x6kf34P5xy2q2b
Job Log
-------
Watching job job-GgKQ6x0071x6kf34P5xy2q2b. Press Ctrl+C to stop watching.
* Python version of fastq_trimmer (python_fastq_trimmer:main) (running)
* job-GgKQ6x0071x6kf34P5xy2q2b
kyclark 2024-02-26 14:32:36 (running for 0:00:21)
2024-02-26 14:33:17 Python version of fastq_trimmer INFO Logging initialized
(priority)
2024-02-26 14:33:17 Python version of fastq_trimmer INFO Logging initialized
(bulk)
2024-02-26 14:33:21 Python version of fastq_trimmer INFO Downloading bundled
file resources.tar.gz
2024-02-26 14:33:22 Python version of fastq_trimmer STDOUT >>> Unpacking
resources.tar.gz to /
2024-02-26 14:33:22 Python version of fastq_trimmer STDERR tar: Removing
leading `/' from member names
2024-02-26 14:33:22 Python version of fastq_trimmer INFO Setting SSH public key
2024-02-26 14:33:23 Python version of fastq_trimmer STDOUT dxpy/0.369.0
(Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-26 14:33:23 Python version of fastq_trimmer STDOUT Invoking main with
{'input_file': {'$dnanexus_link': 'file-FvQGZb00bvyQXzG3250XGbgz'},
'quality_score': 28}
2024-02-26 14:33:24 Python version of fastq_trimmer STDOUT
fastq_quality_trimmer -Q 33 -t 28 -i small-celegans-sample.fastq -o
small-celegans-sample.filtered.fastq
* Python version of fastq_trimmer (python_fastq_trimmer:main) (done)
* job-GgKQ6x0071x6kf34P5xy2q2b
kyclark 2024-02-26 14:32:36 (runtime 0:00:20)
Output: output_file = file-GgKQ79j0B2FQjGbk0qX6j64B$ dx head file-GgKQ79j0B2FQjGbk0qX6j64B
@SRR070372.1 FV5358E02GLGSF length=78
TTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTNTTTNTTTNTTTATTTATTTATTTATTATTATATATATATA
+SRR070372.1 FV5358E02GLGSF length=78
...000//////999999<<<=<<666!602!777!922!688:669A9=<=122569AAA?>@BBBBAA?=
@SRR070372.2 FV5358E02FQJUJ length=177
TTTCTTGTAATTTGTTGGAATACGAGAACATCGTCAATAATATATCGTATGAATTGAACCACACGGCACATATTTGAACTTGTTCGTGAAATTTAGCGAACCTGGCAGGACTCGAACCTCCAATCTTCGGATCCGAAGTCCGACGCCCCCGCGTCGGATGCGTTGTTACCACTGCTT
+SRR070372.2 FV5358E02FQJUJ length=177
222@99912088>C<?7779@<GIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIC;6666IIIIIIIIIIII;;;HHIIE>944=>=;22499;CIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIH?;;;?IIEEEEEEEEIIII77777I7EEIIEEHHHHHIIIIIIIIIIIIII
@SRR070372.3 FV5358E02GYL4S length=70
TTGGTATCATTGATATTCATTCTGGAGAACGATGGAACATACAAGAATTGTGTTAAGACCTGCAT$ dx cat file-GgKQ79j0B2FQjGbk0qX6j64B | wc -l
99952
$ dx cat file-FvQGZb00bvyQXzG3250XGbgz | wc -l
100000The path will include the tag from the Docker Repository.
Use up to date Docker Images from reliable sources
Next, save the Docker Image:
-o : the output. The file needs to be with the .tar.gz ending
The image will be referenced with the path, including tags
Finally, upload the saved image to the project:
Add –path project-ID:/ to dx upload command to ensure that it is being added to the Cloud Workspace Container.
When finished uploading, utilize Cloud Workstation to use the Docker image using:
or terminate the Cloud Workstation job, and then proceed to building the applet.
We will use dx-app-wizard to create a skeleton applet structure with these files:
First, give the applet a name. The prompt shows that only letters, numbers, a dot, underscore, and a dash can be used. As stated earlier, this applet name will also be the name of the directory. Use samtools_count_docker_bundle:
Next is the title. Note that the prompt includes empty square brackets ([]), which contain the default value if Enter is pressed. As title is not required, it contains the empty string, but add an informational title “Samtools Count”
Likewise, the summary field is not required:
The version is also optional, and press Enter to take the default:
There is one input for this applet, which is a BAM file.
Use the parameters for the input section:
name: bam
label: BAM file
class: file
optional: false
When prompted for the first input, enter the following:
The name of the input will be used as a variable in the bash code, so use only letters, numbers, and underscores as in bam or bam_file.
The label is optional, as noted by the empty square brackets.
The types include primitives like integers, floating-point numbers, and strings, as well as arrays of primitive types.
This is a required input. If an input is optional, provide a default value.
When prompted for the second input, press Enter:
There is one output for this applet, which is a counts file.
Use the parameters for the output section:
name: counts
label: counts file
class: file
When prompted for the first output name, enter the following:
This name will also become a bash variable, so best practice is to use letters, numbers, and underscores.
The label is optional.
The class must be from the preceding list. To be reminded of the choices, press the Tab key twice.
When prompted for the second output, press Enter:
Here are the final settings to complete the wizard:
Timeout Policy: 48h
Programming language: bash
Access to internet: No (default)
Access to parent project: No (default)
Instance Type: mem1_ssd1_v2_x4 (default)
Applets are required to set a maximum time for running to prevent a job from running an excessively long time. While some applets may legitimately need days to run, most probably need something in the range of 12-48 hours. As noted in the prompt, use m, h, or d to specify minutes, hours, or days, respectively:
For the template language, select from bash or Python for the program that is executed when the applet starts. The applet code can execute any program available in the execution environment, including custom programs written in any language. Choose bash:
Next, determine if the applet has access to the internet and/or the parent project. Unless the applet specifically needs access, such as to download a file at runtime, it's best to answer no:
Lastly, I must specify a default instance type. The prompt includes an abbreviated list of instance types. The final number indicates the number of cores, e.g., _x4 indicates 4 cores. The greater the number of cores, the more available memory and disk space. In this case, a small 4-core instance is sufficient:
The user is always free to override the instance type using the --instance-type option to dx run.
The final output from dx-app-wizard is a summary of the files that are created:
Readme.developer.md : This file should contain applet implementation details.
Readme.md: This file should contain user help.
dxapp.json: The answers from dx-app-wizard are used to create the app metadata.
resources/ : The resources directory is for any additional files you want available on the runtime instance.
src/ : The src (pronounced "source") is a conventional place for source code, but it's not a requirement that code lives in this directory.
src/samtools_count.sh : This is the bash script that will be executed when the applet is run.
test/ The test directory is empty and will not be discussed in this section.
The contents of the resources directory will be placed into the root directory of the runtime instance. For instance, if there is a file resources/my_tool, then it will be available on the runtime instance as /my_tool. For the sh code, reference the full path (/my_tool) or expand the $PATH variable to include /. Best practice is to create the directory structure resources/usr/local/bin/, and then the file will be at /usr/local/bin/my_tool as /usr/local/bin normally part of $PATH.
Dxapp.json
This is where the formatting from the dx-app-wizard is listed in a .json file. If needed, change the settings for the output, input, version, etc within the json file.
The first section is the metadata, as shown below:
The next section(s) are Inputs and Outputs, shown below:
Finally, the last section is the Additional Settings, shown below:
Adding A Docker Image into the Resources Folder
Add your Docker Image to the resources folder.
dx download the samtools.tar.gz
mv samtools.tar.gz to the samtools_count_docker_bundle/resources/ folder
Samtools_docker.sh
Update the following .sh code file for this applet:
#!/bin/bash is the “shebang” command to show that it is a bash script
set -exuo pipefail is the pragma to show each command as it is executed and to halt on undefined variables or failed system calls
Within the “main” section, there are code lines that:
Echo the value of the input, “bam”, using the name $bam, which is part of the input Spec
Download the input file onto the job instance, with the output being the name of the bam file (ex: ___.bam)
The first Docker command, which loads the saved Docker image, samtools.tar.gz (which is in the resources folder)
Assigning a counts_id variable for the name of the counts file output for samtools
The second Docker Command
Docker run to run the Docker Image
-v /home/dnanexus:/home/dnanexus to mount the volume
The name of the Docker Image, including the tag.
Assigning a variable (upload) for uploading the counts file back to the project
Using the upload variable AND the output spec in the json file for the dx-jobutil-add-output command
Once you have added the Docker Image to the resources folder and edited the .sh and .json files, use the following command to create your applet in the project of your choice:
Then, proceed to test your applet!
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
We'll call our new applet python_cnvkit. If you want to start from dx-app-wizard, use the following specs for the inputs and outputs:
bam_tumor
array:file
No
NA
reference
file
No
NA
The output specs are as follows:
cns
array:file
cns_filtered
array:file
plot
array:file
You can also copy the bash applet directory and update the runSpec in dxapp.json to run a Python script and use the CNVKit asset from before:
Here is the input.json:
Update src/python_cnvkit.py to the following:
Use a Python list comprehension to generate a list of file IDs for the tumor BAM files.
Download the reference file.
Initialize a list to hold the download BAM paths.
Download each BAM file into a directory and append the path to the bam_files list.
Create, print, and run the command to execute CNVkit.
Find all the files created in the output directory. The function only returns the filenames, so append the directory name.
For each of the output file categories, filter the output files and upload the output files matching the expected extension.
Compile the given regular expression.
Create a DX file ID link for each uploaded file.
the given files for those matching the regex.
NOTE: The regex (?<!.call).cns$ uses a negative lookbehind to ensure that .call is not preceding .cns.
Here is the output from the job:
You used a for loop to download multiple input BAM files into a local directory.
You used regular expressions to classify the output files into the three output labels.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
The Data Profiler app saves significant time by generating consistent and comprehensive reports on data quality. It helps support informed decision-making, allowing experts to fully understand the data before downstream analysis. From data collection and cleaning to feature engineering, continuously profiling data to understand its evolution and maintain consistent quality throughout the data transformation process is important to help identify potential issues early, enabling adjustments that optimize analysis and performance.
This tool quickly analyzes and visualizes large dataset input from CSV,Parquet, or DNAnexus Apollo Dataset (or Cohort). The point-and-click solution efficiently provides summary statistics and visualizations, enabling a comprehensive understanding of the data. It also highlights data inconsistencies and complexities (e.g., missing and imbalanced data) in a logical and organized manner, guiding you through the structure and content of your data.
There are two ways to run the application:
Direct Access: Go to this link to open the app.
Platform Navigation: Click on the top navigation bar, then select Tools, proceed to the tool library, search for the “Data Profiler” app, select it, then select run within the documentation to start the app.
To run the app, you need to provide the required input files, which are .csv or .parquet files , or a DNAnexus Apollo Dataset (or Cohort).
If you run the app with .csv files or .parquet files, there is an optional input for the Data Dictionary. This is the same Data Dictionary used by Data Model Loader to generate the DNAnexus Apollo Dataset.
Input name
Mandatory/ Optional
Input type/format
Description
input_files
Optional
A list of CSV, TSV, TXT, or parquet files
This is the data that will be profiled by this application. Each file is a table in your dataset. Only one of the following two options should be provided: input_files and dx_record
dx_record
Optional
A DNAnexus Apollo Dataset (or Cohort)
The data in this Dataset (or Cohort) will be profiled by this application.
data_dictionary
optional
A CSV file
This file indicates the relationship between the tables in input_files.
If not provided, the table relationship will be inferred in the job.
Tables for Inputs
For this example,there are 2 tables in your dataset:
patients.csv: a table with patient IDs and other clinical information of the patient
encounters.csv: a table of encounters (i.e. hospital visits) of all patients in the patient.csv
patients.csv
patient_id
name
P0001
John Doe
P0002
Jane Roe
encounters.csv
encounter_id
patient_id
E0001
P0001
E0002
P0001
E0003
P0002
E0004
P0002
In this example dataset, there are 2 patients in the patients.csv, each patient visited the hospital twice.
Data Dictionary
Even though data_dictionary is optional, it is crucial for cross-table functions in Data Profiler. We highly recommend creating one for your dataset.
The data_dictionary is a CSV file that tells Data Profiler how to connect patients.csv and encounters.csv. Given this example, the linked column between these tables is patient_id. The data_dictionary can be as simple as:
entity
name
type
primary_ key_type
referenced_entity_field
relationship
patients
patient_id
string
en counters
encounter_id
string
There are more columns in the data_dictionary that are not mentioned in this example. However, those columns are not required. If you are interested in the full form of data_dictionary or the meaning of each column, please visit this documentation.
There is no need to specify anything in the OUTPUTS section. Once your inputs are ready, click Start Analysis to begin.
In the Review & Start modal, you can either customize the job settings before running the applet or leave them at their default values. The settings you can modify include:
Job Name
Output Location
Priority
Spending Limit
Instance Type
Once you’ve made your adjustments or are satisfied with the default settings, click Launch Analysis to start the job.
After launching the analysis, you will be redirected to the Monitor screen. From there, click the job name to view the job details.
It may take a few minutes for the applet to be ready. To check the status, click View Log and wait for the message indicating that the applet is ready. Once you see the message, click Open Worker URL to launch the app.
The Data Profiler is an HTTPS application on the DNAnexus Platform, which means it should be accessed via the Job URL. It typically takes a few minutes for the web interface to be ready. If you encounter any issues while visiting the Job URL, you can check the job logs for the following message:
Logs from a job instance of Data Profiler indicating the web interface is ready
If this line appears in your job logs, it confirms that the Data Profiler is ready to be accessed through the Job URL.
If you attempt to click the button before the URL is ready, you may encounter a “502 Bad Gateway” error. This is not a problem— it simply means you need to wait a bit longer before the environment is fully prepared.
If you run Data Profiler with a DNAnexus Apollo Dataset (or Cohort), you will be able to select the specific data fields to profile. If you want to profile the whole Dataset, select all data fields and start the job by clicking on the “Start profiling” button.
The table to select columns (data fields) to profile
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select “Contact Support”
Fill in the Subject and Message to submit a support ticket.
You can also set a banner for the home page
If you have questions about how to use a json file, please view this section
In your home.json file, you have to have this as the beginning of the json:
After that, you can customize exactly what you are wanting.
There can be as many of these as you would like
You can also add in tables, images, and footers
EXAMPLE: Code for Images (not banner image):
EXAMPLE: Code for Tables:
EXAMPLE: Code for footer:
Please note that when you are done with your json, to please ensure it is the right format.
Please email [email protected] to create a support ticket if there are technical issues.
To get started, you will build a native bash applet that will execute the venerable wc (word count) Unix command-line program on a file. In this example, you will:
Use the dx-app-wizard to create the skeleton of a native bash applet
Define the inputs and outputs of an applet
Use dx build to build the applet
Import data from a URL
Use dx run to run the applet
The wc command takes one or more files as input. So that we have the same input file, please execute the following command to fetch the URL from Project Gutenberg and write the contents to the local file scarlet.txt:
Or use curl:
By default, wc will print the three columns showing the number of lines, words, and characters of text, in that order, followed by the name of the file:
The output from your version of wc may differ slightly as there are several implementations of the program. For instance, the preceding output is on macOS, which is the BDS version, but the applet will run on Ubuntu Linux using the GNU version. Both programs work essentially the same.
The goal of this applet will be to accept a single file as input and capture the standard out (aka STDOUT) of wc to report the number of lines, words, and characters in the file.
Next, you will create an applet that will accept this file as input, transfer it to a virtual machine, run wc on the file, and return the preceding output as a new file. Run the dx-app-wizard to interactively answer questions about the inputs, outputs, and runtime requirements. Start by executing the program with the -h|--help flag to read the documentation:
As shown in the preceding usage, the name of the applet may be provided as an argument. For instance, you can run dx-app-wizard wc to answer the first question, which is the name of the applet. Note the naming conventions for the applet name, which you should also follow for naming the input and output variables:
Because the name was provided as an argument, the prompt shows [wc]. All the prompts will show a default value that will be used if you press the Enter key. If you wish to override this value, type a new name; otherwise, press Enter.
Next, you will be prompted for a title. The empty brackets ([]) indicate this is optional, but I will provide "Word Count":
Likewise, the summary is optional, but I will provide one:
Indicate the version with major, minor, and patch release:
The input specification follows. Use the name input_file for the first input name and whatever label you like. For the class, choose file to indicate that the user must supply a valid file, and specify that this input is not optional:
As this is the only input, press Enter when prompted for a second input and move to the output specification. To start, call the output outfile and use the class of file:
There is no other output for now, so press Enter to move on to the Timeout Policy. You may choose any amount of time you like such as "1h" to indicate 1 hour:
Next, you will choose whether to use bash or Python as the primary language of the applet. Choose bash:
Choosing bash means that your app will execute a bash script that will use commands from the dxpy module to do things like download and upload files as well as execute any command on the runtime instance, such as custom programs you write in Python, R, C, etc. Choosing Python here means that a Python script will be executed, and it can use the same Python module to do everything the bash script does. This tutorial will only demonstrate bash apps. There is no advantage one language has over the other. You should choose whichever suits your tastes.
During runtime, some apps may need to fetch resources from the internet or from the parent project. Neither of these will apply to this applet, so answer "no" for the next two questions:
Lastly, you will choose a default instance type on which the applet will run. I usually start with the default value, which is a fairly modest machine. If an applet proves it needs more resources, refer to the to choose something else:
The wizard will finish with a listing of the files it has created:
As noted, you will find the following structure in the directory wc:
A directory for tests, mostly used internally by DNAnexus.
A directory for assets like files or binaries you would like copied to the rutime instance.
A JSON file describing the metadata for the applet.
A documentation stub you may wish to update.
In the preceding step, the applet's inputs, outputs, and system requirements were written to the file dxapp.json, which is in JSON (JavaScript Object Notation) format. Open this file to inspect the contents, which begins with the basic metadata about the app:
The inputSpec section shows that this applet takes a single argument of the type file. Update the patterns to include .txt:
The outputSpec shows that the applet will return a file:
The runSpec describes the runtime for the applet:
The default VM is Ubuntu 20.04, which includes Python v3 and R v3. You may also indicate Ubuntu 16.04, which has Python v2.
If you need Ubuntu 16.04 with Python v3, indicate version 1 here; otherwise, leave this 0.
The author has more success installing Python v2 on Ubuntu 20.04 rather than running an older Linux distro.
Finally, the regionalOptions describe the system requirements:
You may use a text editor to alter this file at any time, after which you will need to rebuild the applet.
As indicated in runSpec, the applet will execute the bash script src/wc.sh at runtime. The app wizard created a template that shows one method for download the input file and uploading the output file. Here is a modified version that removes most of the comments for the sake of brevity and adds the applet's business logic in the middle:
I've added this pragma to show each command as it's executed and to halt on undefined variables or failed system calls.
This will download the input file to a local file called input_file on the running instance.
Execute wc on input_file and redirect standard out to the file output.
The local variables $input_file and $output match the names used in the inputSpec and outputSpec. They will only exist at runtime.
Applets and data must live inside a project, so create a new one either using the web interface or the command line by executing dx new project:
Next, you will add the scarlet.txt file to the project. There are several ways you can do this. From the web interface, you can click the "Add" button that will show you options two relevant options:
"Upload Data": This will allow you to upload a file your local computer. You can drag and drop the file into the dialog box or use the file browser to select the file.
"Add Data From Server": This will launch an app that can import files accessible by a URL such as from a web address or FTP server. You should use the Project Gutenberg URL from earlier.
You can also use the dx upload command. If you created the project using the web interface, you will first need to run dx select to select your project:
Note the file's ID, which we will use later for the applet's input. If you use the web interface to upload, you can click the information "I" in the circle to see the file's metadata.
From the command line, you can use dx ls with the -l|--long option to see the file ID:
It's impossible to debug this program locally, so next you will build the applet and run it. If you are in the wc directory, run dx build to build the applet; if you are in the directory above, run dx build wc to indicate the directory that contains the applet. Subsequent builds will require the use of of the -f|--overwrite or -a|--archive flag to indicate what to do with the previous version. For consistency's sake, I always run with the -f flag:
From the web interface, you can now view a web form that will allow you to execute the applet.
You do the same process that is listed in the Overview of that Platform section.
You can also run the applet from the command line using the applet's ID. To begin, use dx run with the -h|--help flag to see the inputs and outputs of the applet:
Run the same command without the help flag to enter an interactive session where you can indicate the input file using the file's ID noted earlier:
You may also use specify the file on the command line:
Notice in both instances, the input is formatted as a JSON document for submission. Copy that JSON into a file with the following contents:
Use this file as the -f|--file input for the applet along with the -y flag to indicate you want to proceed without further confirmation and the --watch flag to enter into a watch of the applet's progress:
The end of the job's output should look like the following:
Run dx describe on the indicated output file ID to see the metadata about the file. Then execute dx cat to see the contents of the file, which should be the same results as when the program ran locally:
In this chapter, you did the following:
Learned the structure of a native bash and how to use the wizard to create a new app
Built an app and ran it from the command line and the web interface
Inspected the output of an applet
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
In this example, you will learn:
How to to accept a BAM file as a workflow input
Break the BAM into slices by chromosome
Distribute the slices in parallel to count the number of alignments in each
To begin, create a new directory called view_and_count and a workflow.wdl file.
Here is the workflow defintion you should add:
The name of this workflow is bam_chrom_counter.
The workflow accepts a single, required File input that will be called bam as it is expected to be a BAM file.
Use a to define a String value of the Docker file containing Samtools.
Following is the slice_bam task that uses to index the input BAM file and break it into separate files for each of the 22 human chromosomes:
The inputs to this task are the BAM file and the name of the Docker image.
The command block uses triple-angle brackets because it must use the dollar sign ($) in shell code.
Use on the input BAM file for fast random access to the alignments.
The $()
The count_bam task is written to handle just one BAM slice:
This BAM input will be a slice of alignments for a given region. Naming this bam does not interfere with the bam variable in the workflow or any other task.
Use the command with -c|--count to count the number of alignments in the given file.
The output of this task uses the function to read the STDOUT from the command as an integer value.
At this point, I like to use miniwdl to check the syntax:
As no errors are reported, I will compile this onto the DNAnexus platform:
Finally, I will run this workflow using a sample BAM file:
Return to the DNAnexus website to monitor the progress of the analysis.
As the number of tasks increase, workflow definitions can get quite long. You can shorten the workflow.wdl by placing each task in a separate file, which also makes it easier to reuse a task in a separate workflow. To do this, create a subdirectory called tasks, and then create a file called tasks/slice_bam.wdl with the following contents:
Also create the file tasks/count_bam.wdl with the following contents:
Both of the preceding tasks are identical to the original definitions, but note that the files include a version that matches the version of the workflow. Change workflow.wdl as follows:
Use to include WDL code from a file or URI. Note the use of the as clause to alias the imports using a different name.
Call task_slice_bam.slice_bam from the imported file using as to give it the same name as in the original workflow.
Do the same with task_count_bam.count_bam.
Use miniwdl to check your syntax, then use dxCompiler to create an app.
In this lesson, you learned how to:
Accept a file as a workflow input
Define a non-input declaration
Use scatter to run tasks in parallel
Use the output from one task as the input to another task
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
You can write the wc applet using Workflow Description Language (WDL), which is a high-level way to define and chain tasks. You will start by defining a single task, which compiles to an applet on the DNAnexus platform.
In this example, you will:
Write the wc applet using WDL
In the bash applet, the inputs, outputs, and runtime specifications are defined in the dxapp.json file, and the code that runs lives in a separate file. WDL combines all of this into a single file. Create a new directory for your work, and then add the following to a file called wc.wdl:
There are several versions of WDL, and this indicates the file will use .
A task in WDL will compile to an applet in DNAnexus.
The input block equates to the inputSpec from the previous chapter. Each input value is declared with a . Here the input is a File.
First, ensure you have a working Java compiler and have installed all the Java Jar files as described in Chapter 1. Use WOMtool to validate the WDL syntax:
If you installed the Python miniwdl program, you can also use it to check the syntax. The output on success is something like a parse tree:
To demonstrate the output on error, I'll change the word File to Fiel:
Here is the equivalent error from WOMtool:
The two tools are written in different languages (Java and Python) and have different stringencies of parsing and different ways of reporting errors. You may find it helpful to use both to track down errors.
First, use dx pwd to check if you are in your wc project; if not, use dx select to change. Now you can use the dxCompiler jar file you downloaded in Chapter 1 to compile the WDL into an applet:
Run the new applet from the CLI with the help flag to inspect the usage:
Whether you use bash or WDL to write an applet, the compiled result works the same for the user.
If you look in the web interface, you should see a new wc_wdl object in the project as shown in Figure 1.
Click on the applet to launch the user interface as shown in Figure 2. Select an input file and launch the applet.
As with the bash version, you can launch the applet using the command line arguments:
The output from the job will look different, but the result will be the same. You can use dx describe with the --json option to get a JSON document describing the entire job and pipe this to the tool to extract the output section:
The dx cat command allows you to quickly see the contents of the output file without having to download it to your computer:
This is the same output as from the previous chapter.
Depending on your comfort level with WDL, you may or may not find this version simpler than the bash version. The result is the same no matter how you write the applet, so it's a matter of taste as to which you should select.
In this chapter, you learned how to:
Write a WDL task
Use WOMtool and miniwdl to validate WDL syntax
Compile a WDL task into an applet
Use the JSON output from dx describe and jq
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
dx run app-cloud_workstation --instance-type mem1_ssd2_v2_x72 --ssh -ydocker pull biocontainers/samtools:v1.9-4-deb_cv1docker save -o samtools.tar.gz biocontainers/samtools:v1.9-4-deb_cv1dx upload samtools.tar.gz --path project-ID:/docker run -it biocontainers/samtools:v1.9-4-deb_cv1dx-app-wizard
DNAnexus App Wizard, API v1.0.0
Basic Metadata
Please enter basic metadata fields that will be used to describe your app. Optional fields are denoted by options with square brackets. At the end of this wizard, the files necessary for building your app will be generated from the answers you provide.The name of your app must be unique on the DNAnexus platform. After creating your app for the first time, you will be able to publish new versions using the same app name. App names are restricted to alphanumeric characters (a-z, A-Z, 0-9), and the characters ".", "_", and "-".
App Name: samtools_count_docker_bundleThe title, if provided, is what is shown as the name of your app on the website. It can be any valid UTF-8 string.
Title []: Samtools CountThe summary of your app is a short phrase or one-line description of what your app does. It can be any UTF-8 human-readable string.
Summary []: Count SAM/BAM alignmentsYou can publish multiple versions of your app, and the version of your app is a string with which to tag a particular version. We encourage the use of Semantic Versioning for labeling your apps (see http://semver.org/ for more details).
Version [0.0.1]:Input Specification
You will now be prompted for each input parameter to your app. Each parameter should have a unique name that uses only the underscore "_" and alphanumeric characters, and does not start with a number.
1st input name (<ENTER> to finish): bam
Label (optional human-readable name) []: BAM File
Your input parameter must be of one of the following classes:
applet array:file array:record file int
array:applet array:float array:string float record
array:boolean array:int boolean hash string
Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: n2nd input name (<ENTER> to finish):Output Specification
You will now be prompted for each output parameter of your app. Each parameter should have a unique name that uses only the underscore "_" and alphanumeric characters, and does not start with a number.
1st output name (<ENTER> to finish): counts
Label (optional human-readable name) []: Counts File
Choose a class (<TAB> twice for choices): file2nd output name (<ENTER> to finish):Timeout Policy
Set a timeout policy for your app. Any single entry point of the app that runs longer than the specified timeout will fail with a TimeoutExceeded error. Enter an int greater than 0 with a single-letter suffix (m=minutes,h=hours, d=days) (e.g. "48h").
Timeout policy [48h]:Template Options
You can write your app in any programming language, but we provide templates for the following supported languages: Python, bash
Programming language: bashAccess Permissions
If you request these extra permissions for your app, users will see this fact when launching your app, and certain other restrictions will apply. For more information, see https://documentation.dnanexus.com/developer/apps/app-permissions.
Access to the Internet (other than accessing the DNAnexus API).
Will this app need access to the Internet? [y/N]: n
Direct access to the parent project. This is not needed if your app specifies outputs,which will be copied into the project after it's done running.
Will this app need access to the parent project? [y/N]: nDefault instance type: The instance type you select here will apply to all entry points in your app unless you override it. See https://documentation.dnanexus.com/developer/api/running-analyses/instance-types for more information.
Choose an instance type for your app [mem1_ssd1_v2_x4]:
*** Generating DNAnexus App Template... ***
Your app specification has been written to the dxapp.json file. You can specify more app options by editing this file directly (see https://documentation.dnanexus.com/developer for complete documentation).
Created files:
samtools_count_docker_bundle/Readme.developer.md
samtools_count_docker_bundle/Readme.md
samtools_count_docker_bundle/dxapp.json
samtools_count_docker_bundle/resources/
samtools_count_docker_bundle/src/
samtools_count_docker_bundle/src/samtools_count.sh
samtools_count_docker_bundle/test/
App directory created! See https://documentation.dnanexus.com/developer for tutorials on how to modify these files, or run "dx build samtools_count" or "dx build --create-app samtools_count_docker_bundle" while logged in with dx.
Running the DNAnexus build utility will create an executable on the DNAnexus platform. Any files found in the resources directory will be uploaded so that they will be present in the root directory when the executable is run.{
"name": "samtools_count_docker_bundle",
"title": "Samtools Count",
"summary": " Count SAM/BAM alignments",
"dxapi": "1.0.0",
"version": "0.0.1","inputSpec": [
{
"name": "bam",
"label": "BAM file",
"class": "file",
"optional": false,
"patterns": [
"*.bam"
],
"help": ""
}
],
"outputSpec": [
{
"name": "counts",
"label": "counts file",
"class": "file",
"patterns": [
"*"
],
"help": ""
}
],"runSpec": {
"timeoutPolicy": {
"*": {
"hours": 3
}
},
"interpreter": "bash",
"file": "src/samtools_docker.sh",
"distribution": "Ubuntu",
"release": "24.04",
"version": "0"
},
"regionalOptions": {
"aws:us-east-1": {
"systemRequirements": {
"*": {
"instanceType": "mem1_ssd1_v2_x4"
}
}
}
}
}#!/bin/bash
set -exuo pipefail
main() {
echo "Value of bam: '$bam'"
dx download "$bam" -o "$bam_name"
docker load < "/samtools.tar.gz"
counts_id=${bam_prefix}.counts.txt
docker run -v /home/dnanexus:/home/dnanexus \
biocontainers/samtools:v1.9-4-deb_cv1 samtools view -c "/home/dnanexus/${bam_name}" > "/home/dnanexus/${counts_id}"
upload=$(dx upload "$counts_id" --brief)
dx-jobutil-add-output counts "$upload" --class=file
}
dx build samtools_count_docker_bundle "runSpec": {
"timeoutPolicy": {
"*": {
"hours": 48
}
},
"interpreter": "python3",
"file": "src/python_cnvkit.py",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0",
"assetDepends": [{"id": "record-GgP33b00BppJKpyyFxGpZJYf"}],
}{
"bam_tumor": [
{
"$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
}
],
"reference": {
"$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
}
}#!/usr/bin/env python
import os
import dxpy
import re
import sys
from typing import List
from subprocess import getstatusoutput
@dxpy.entry_point("main")
def main(bam_tumor, reference):
bam_tumor = [dxpy.DXFile(item) for item in bam_tumor] # 1
reference = dxpy.DXFile(reference) # 2
reference_name = reference.describe().get("name", "reference.cnn")
dxpy.download_dxfile(reference.get_id(), reference_name)
bam_dir = "bams"
os.makedirs(bam_dir)
bam_files = [] # 3
for file in bam_tumor:
desc = file.describe()
file_id = file.get_id()
path = os.path.join(bam_dir, desc.get("name", file_id))
dxpy.download_dxfile(file_id, path) # 4
bam_files.append(path)
out_dir = "cnvkit-out"
cmd = (
f"cnvkit.py batch {' '.join(bam_files)} "
f"-r {reference_name} "
f"-p $(expr $(nproc) - 1) "
f"-d {out_dir} --scatter"
)
print(cmd)
rv, out = getstatusoutput(cmd) # 5
if rv != 0:
sys.exit(out)
out_files = [os.path.join(out_dir, file) for file in os.listdir(out_dir)] # 6
print('out_files = {",".join(out_files)}')
return {
"cns": upload("\.call\.cns$", out_files), # 7
"cns_filtered": upload("(?<!\.call)\.cns$", out_files),
"plot": upload("-scatter.png$", out_files),
}
def upload(pattern: str, paths: List[str]) -> List[str]:
"""Upload files matching a pattern and return DX link"""
regex = re.compile(pattern) # 8
return [
dxpy.dxlink(dxpy.upload_local_file(file)) # 9
for file in filter(regex.search, paths) # 10
]
dxpy.run()Job Log
-------
Watching job job-GgP7Z30071x73vpBzXK1jk7X. Press Ctrl+C to stop watching.
* CNVKit (python_cnvkit:main) (running) job-GgP7Z30071x73vpBzXK1jk7X
kyclark 2024-02-27 17:10:52 (running for 0:01:57)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (priority)
2024-02-27 17:13:28 CNVKit INFO Logging initialized (bulk)
2024-02-27 17:13:34 CNVKit INFO Downloading bundled file cnvkit_asset.tar.gz
2024-02-27 17:14:02 CNVKit STDOUT >>> Unpacking cnvkit_asset.tar.gz to /
2024-02-27 17:14:02 CNVKit STDERR tar: Removing leading `/' from member names
2024-02-27 17:15:36 CNVKit INFO Setting SSH public key
2024-02-27 17:15:39 CNVKit STDOUT dxpy/0.369.0
(Linux-5.15.0-1053-aws-x86_64-with-glibc2.29) Python/3.8.10
2024-02-27 17:15:40 CNVKit STDOUT Invoking main with {'bam_tumor':
[{'$dnanexus_link': 'file-GFxXjV006kZVQPb20G85VXBp'}], 'reference':
{'$dnanexus_link': 'file-GFxXvpj06kZfP0QVKq2p2FGF'}}
2024-02-27 17:16:16 CNVKit STDOUT Running "cnvkit.py batch
bams/HCC1187_1x_tumor_markdup.bam -r reference.cnn -p $(expr $(nproc) - 1) -d
cnvkit-out --scatter"
2024-02-27 17:19:57 CNVKit STDOUT out_files = {",".join(out_files)}
* CNVKit (python_cnvkit:main) (done) job-GgP7Z30071x73vpBzXK1jk7X
kyclark 2024-02-27 17:10:52 (runtime 0:07:54)
Output: cns = [ file-GgP7jF80K7VPVpkkkzyqBK2Q ]
cns_filtered = [ file-GgP7jF80K7V7q1jJVPYJj0pg,
file-GgP7jFQ0K7VFfb7BJ3YbYy60 ]
plot = [ file-GgP7jFQ0K7V115GPfGYB2j6b ]{
"order": ["banner_image", "template_projects", "academy_links", "dnanexus_links"],
"components": {
"banner_image": {
"type": "image",
"id": "banner_image",
"src": "#banner_image.png"
},
"template_projects": {
"type": "project",
"id": "template_projects",
"title": "Template Projects",
"query": {
"tags": "Template Course",
"limit": 5
},
"columns":[
{
"property": "name",
"label": "Name"
},
{
"property": "level",
"formatter": "capitalize",
"label": "Access"
}
],
"viewMore": "/communities/academy_curriculum/projects",
"minWidth": "400px"
},
"academy_links": {
"type": "link",
"id": "academy_links",
"title": "DNAnexus Academy Links",
"links": [
{
"name": "Academy Documentation",
"href": "https://academy.dnanexus.com"
}
],
"minWidth": "400px"
},
"dnanexus_links": {
"type": "link",
"id": "dnanexus_links",
"title": "DNAnexus Links",
"links": [
{
"name": "DNAnexus Website",
"href": "https://www.dnanexus.com"
},
{
"name": "DNAnexus Documentation",
"href": "https://documentation.dnanexus.com"
}
],
"minWidth": "400px"
}
}
}{
"order": [ #LIST #HERE ],
"components": {
#FILL WITH SECTIONS HERE
}
}"banner_image": {
"type": "image",
"id": "banner_image", #keep the ids lower case and with no spaces
"src": "#banner_image.png" #you will need an image when you upload, chnange this name to whatever you want to call it, but leave the # in front of it
},"template_projects": {
"type": "project",
"id": "template_projects", #keep the ids lower case and with no spaces
"title": "Template Projects", #this is what will show up on the portal as the name
"query": {
"tags": "Template Course", #this is the tag for my template course projects
"limit": 5 #this is how many of the courses I want to show up
},
"columns":[ #these are the columns you want viewable as part of your table. I picked name and access level.
{
"property": "name",
"label": "Name"
},
{
"property": "level",
"formatter": "capitalize",
"label": "Access"
}
],
"viewMore": "/communities/academy_curriculum/projects", #this sets the parameter for a list of the rest of the projects with the tag that I have selected.
"minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another.
}, "academy_links": {
"type": "link",
"id": "academy_links", #keep the ids lower case and with no spaces
"title": "DNAnexus Academy Links", #title that shows up on the home page
"links": [
{
"name": "Academy Documentation", #name that shows up for the link
"href": "https://academy.dnanexus.com" #link I want used
}
],
"minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another.
}, "dnanexus_links": {
"type": "link",
"id": "dnanexus_links", #keep the ids lower case and with no spaces
"title": "DNAnexus Links", #title that shows up for the home page
"links": [
{
"name": "DNAnexus Website", #name that shows up for the link
"href": "https://www.dnanexus.com" #link I want used
},
{
"name": "DNAnexus Documentation", #name that shows up for the link
"href": "https://documentation.dnanexus.com" #link I want used
}
],
"minWidth": "400px" #this sets the width on the portal home page for this section. If you want them to take up the whole page, you do not have to have this. I set it to 400 so that I could add multiple columns. If you do not set this, you will have these as rows, one table after another.
}
}"example_image": {
"type": "image",
"id": "example-image", #id for order purposes
"src": "https://example.com/image.png", #you can set the source for this as a public link or with a "#" if you have the image locally.
"alt": "Alt text" #text
},"table-example": {
"type": "markdown", #format for the table
"id": "table_example", #id for the order of content
"title": "Table Example",
"content": "LIST MARKDOWN CONTENT HERE FOR TABLE", #this will need to be your code for a table
"minWidth": "100px"
},"footer": {
"name": "DNAnexus Help",
"href": "https://www.dnanexus.com/help"
},
"minWidth": "300px"The samtools command that is being run in the applet, including the location of the output file as /home/dnanexus/${counts_id}

en counters
patient_id
string
patients: patient_id
many_to_one
Another documentation stub.
A directory to place source code for the applet.
The bash script template to execute the applet.
This command will link the output file as an output of the applet.
The first call will be to the slice_bam task that will break the BAM into one file per chromosome. The input for this task is the workflow's BAM file.
The scatter directive in WDL causes the actions in the block to be executed in parallel, which can lead to significant performance gains. Here, the each slice file returned from the slice_bam task will be used as the input to the count_bam task.
The workflow defines two outputs: a BAM index file and an array of integer values representing the number of alignments in each of the BAM slices.
seqThe samtools view will display the alignments in BAM format for a region like "chr1" and place the output into the file slices/1.bam. Note the mix of ~ for WDL variables and $ for bash variables.
The runtime block allows you to define a Docker image that contains an installation of Samtools.
The output of this task is the BAM index, which is the given BAM file plus the suffix .bai, and the sliced alignment files.
The slices will be one or more files as indicated by Array[File], and they will be found using the glob function to look in the slices directory for all files with the extension .bam.
Mix ~ and $ in command blocks to dereference WDL and shell variables
Import WDL from external sources such as local files or remote URIs.
The command block contains the bash code that will be executed at runtime.
The output block equates to the outputSpec from the previous chapter. As with inputs, each output must declare a type.
The runtime block equates to the runSpec from the previous chapter. Here, you define that the task will use a Docker image of Ubuntu Linux 20.04.
Use dx cat to see the contents of a file on the DNAnexus platform


$ wget -O scarlet.txt https://www.gutenberg.org/cache/epub/33/pg33.txt$ curl -o scarlet.txt https://www.gutenberg.org/cache/epub/33/pg33.txt$ wc scarlet.txt
8590 86055 513523 scarlet.txt$ dx-app-wizard -h
usage: dx-app-wizard [-h] [--json-file JSON_FILE] [--language LANGUAGE]
[--template {basic,parallelized,scatter-process-gather}]
[name]
Create a source code directory for a DNAnexus app. You will be prompted for
various metadata for the app as well as for its input and output
specifications.
positional arguments:
name Name of your app
optional arguments:
-h, --help show this help message and exit
--json-file JSON_FILE
Use the metadata and IO spec found in the given file
--language LANGUAGE Programming language of your app
--template {basic,parallelized,scatter-process-gather}
Execution pattern of your app$ dx-app-wizard wc
DNAnexus App Wizard, API v1.0.0
Basic Metadata
Please enter basic metadata fields that will be used to describe your app.
Optional fields are denoted by options with square brackets. At the end of
this wizard, the files necessary for building your app will be generated from
the answers you provide.
The name of your app must be unique on the DNAnexus platform. After
creating your app for the first time, you will be able to publish new versions
using the same app name. App names are restricted to alphanumeric characters
(a-z, A-Z, 0-9), and the characters ".", "_", and "-".
App Name [wc]:The title, if provided, is what is shown as the name of your app on
the website. It can be any valid UTF-8 string.
Title []: Word CountThe summary of your app is a short phrase or one-line description of
what your app does. It can be any UTF-8 human-readable string.
Summary []: Find the number of lines, words, and characters in a fileYou can publish multiple versions of your app, and the version of your
app is a string with which to tag a particular version. We encourage the use
of Semantic Versioning for labeling your apps (see http://semver.org/ for more
details).
Version [0.0.1]: 0.1.0Input Specification
You will now be prompted for each input parameter to your app. Each parameter
should have a unique name that uses only the underscore "_" and alphanumeric
characters, and does not start with a number.
1st input name (<ENTER> to finish): input_file
Label (optional human-readable name) []: Input file
Your input parameter must be of one of the following classes:
applet array:file array:record file int
array:applet array:float array:string float record
array:boolean array:int boolean hash string
Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: nOutput Specification
You will now be prompted for each output parameter of your app. Each
parameter should have a unique name that uses only the underscore "_" and
alphanumeric characters, and does not start with a number.
1st output name (<ENTER> to finish): output
Label (optional human-readable name) []: Output file
Choose a class (<TAB> twice for choices): fileTimeout Policy
Set a timeout policy for your app. Any single entry point of the app
that runs longer than the specified timeout will fail with a TimeoutExceeded
error. Enter an int greater than 0 with a single-letter suffix (m=minutes,
h=hours, d=days) (e.g. "48h").
Timeout policy [48h]: 1hTemplate Options
You can write your app in any programming language, but we provide
templates for the following supported languages: Python, bash
Programming language: bashAccess to the Internet (other than accessing the DNAnexus API).
Will this app need access to the Internet? [y/N]: n
Direct access to the parent project. This is not needed if your app
specifies outputs, which will be copied into the project after it's done
running.
Will this app need access to the parent project? [y/N]: nDefault instance type: The instance type you select here will apply to
all entry points in your app unless you override it. See https://documenta
tion.dnanexus.com/developer/api/running-analyses/instance-types for more
information.
Choose an instance type for your app [mem1_ssd1_v2_x4]:*** Generating DNAnexus App Template... ***
Your app specification has been written to the dxapp.json file. You can
specify more app options by editing this file directly (see
https://documentation.dnanexus.com/developer for complete documentation).
Created files:
wc/Readme.developer.md
wc/Readme.md
wc/dxapp.json
wc/resources/
wc/src/
wc/src/wc.sh
wc/test/
App directory created! See https://documentation.dnanexus.com/developer for
tutorials on how to modify these files, or run "dx build wc" or "dx build
--create-app wc" while logged in with dx.
Running the DNAnexus build utility will create an executable on the DNAnexus
platform. Any files found in the resources directory will be uploaded
so that they will be present in the root directory when the executable is run.$ find wc
wc
wc/test # 1
wc/resources #2
wc/dxapp.json # 3
wc/Readme.md # 4
wc/Readme.developer.md # 5
wc/src # 6
wc/src/wc.sh # 7{
"name": "wc",
"title": "Word Count",
"summary": "Find the number of lines, words, and characters in a file",
"dxapi": "1.0.0",
"version": "0.1.0", "inputSpec": [
{
"name": "input_file",
"label": "Input file",
"class": "file",
"optional": false,
"patterns": [
**"*.txt"**
],
"help": ""
}
], "outputSpec": [
{
"name": "output",
"label": "Output",
"class": "file",
"patterns": [
"*"
],
"help": ""
}
], "runSpec": {
"timeoutPolicy": {
"*": {
"hours": 1
}
},
"interpreter": "bash",
"file": "src/wc.sh",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0"
}, "regionalOptions": {
"aws:us-east-1": {
"systemRequirements": {
"*": {
"instanceType": "mem1_ssd1_v2_x4"
}
}
}
}
}#!/bin/bash
set -exo pipefail
main() {
echo "Value of input_file: '$input_file'"
dx download "$input_file" -o input_file
wc input_file > output.txt
output_id=$(dx upload output.txt --brief)
dx-jobutil-add-output output "$output_id" --class=file
}$ dx new project wc
Created new project called "wc" (project-GGyG8K80K9ZKzkX812yY893V)
Switch to new project now? [y/N]: y$ dx select project-GGyG8K80K9ZKzkX812yY893V
Selected project project-GGyG8K80K9ZKzkX812yY893V
$ dx upload scarlet.txt
[===========================================================>]
Uploaded 513,523 of 513,523 bytes (100%) scarlet.txt
ID file-GGyG8z00K9Z9GQ9jG4qB4gpX
Class file
Project project-GGyG8K80K9ZKzkX812yY893V
Folder /
Name scarlet.txt
State closing
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created Tue Oct 4 16:40:44 2022
Created by kyclark
Last modified Tue Oct 4 16:40:47 2022
Media type
archivalState "live"
cloudAccount "cloudaccount-dnanexus"$ dx ls -l
Project: wc (project-GGyG8K80K9ZKzkX812yY893V)
Folder : /
State Last modified Size Name (ID)
closed 2022-10-04 16:40:48 501.49 KB scarlet.txt (file-GGyG8z00K9Z9GQ9jG4qB4gpX)$ dx build -f
{"id": "applet-GGyGVP00K9Z4Z6VgBgkk0b06"}$ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -h
usage: dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 [-iINPUT_NAME=VALUE ...]
Applet: Word Count
Find the number of lines, words, and characters in a file
Inputs:
Input file: -iinput_file=(file)
Outputs:
Output: output (file)$ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06
Entering interactive mode for input selection.
Input: Input file (input_file)
Class: file
Enter file ID or path (<TAB> twice for compatible files in current directory,
'?' for more options)
input_file: file-GGyG8z00K9Z9GQ9jG4qB4gpX
Using input JSON:
{
"input_file": {
"$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
}
}
Confirm running the executable with this input [Y/n]: n$ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -iinput_file=file-GGyG8z00K9Z9GQ9jG4qB4gpX
Using input JSON:
{
"input_file": {
"$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
}
}
Confirm running the executable with this input [Y/n]: n$ cat inputs.json
{
"input_file": {
"$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
}
}$ dx run applet-GGyGVP00K9Z4Z6VgBgkk0b06 -f inputs.json -y --watch
Using input JSON:
{
"input_file": {
"$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
}
}
Calling applet-GGyGVP00K9Z4Z6VgBgkk0b06 with output destination
project-GGyG8K80K9ZKzkX812yY893V:/
Job ID: job-GGyGZPQ0K9Z7PXybBp52P3xF
Job Log
-------
Watching job job-GGyGZPQ0K9Z7PXybBp52P3xF. Press Ctrl+C to stop watching.2022-10-04 17:08:36 Word Count STDERR + wc input_file
2022-10-04 17:08:36 Word Count STDERR ++ dx upload output --brief
2022-10-04 17:08:37 Word Count STDERR + output=file-GGyGf100qZbvFjb3GqfG6kzj
2022-10-04 17:08:37 Word Count STDERR + dx-jobutil-add-output output
file-GGyGf100qZbvFjb3GqfG6kzj --class=file$ dx cat file-GGyGf100qZbvFjb3GqfG6kzj
8590 86055 513523 input_fileversion 1.0
workflow bam_chrom_counter {
input {
File bam
}
String docker_img = "quay.io/biocontainers/samtools:1.12--hd5e65b6_0"
call slice_bam {
input : bam = bam,
docker_img = docker_img
}
scatter (slice in slice_bam.slices) {
call count_bam {
input: bam = slice,
docker_img = docker_img
}
}
output {
File bai = slice_bam.bai
Array[Int] count = count_bam.count
}
}task slice_bam {
input {
File bam
String docker_img
}
command <<<
set -ex
samtools index "~{bam}"
mkdir slices
for i in $(seq 22); do
samtools view -b -o "slices/$i.bam" "~{bam}" "chr${i}"
done
>>>
runtime {
docker: docker_img
}
output {
File bai = "~{bam}.bai"
Array[File] slices = glob("slices/*.bam")
}
}task count_bam {
input {
File bam
String docker_img
}
command <<<
samtools view -c "~{bam}"
>>>
runtime {
docker: docker_img
}
output {
Int count = read_int(stdout())
}
}$ miniwdl check workflow.wdl
workflow.wdl
workflow bam_chrom_counter
call slice_bam
scatter slice
call count_bam
task count_bam
task slice_bam$ java -jar ~/dxCompiler-2.10.2.jar compile workflow.wdl \
-archive \
-folder /workflows \
-project project-GFPQvY007GyyXgXGP7x9zbGb
workflow-GFqF27j07GyZ33JX4vzqgK32$ dx run workflow-GFqF27j07GyZ33JX4vzqgK32 \
> -istage-common.bam=file-G8V38KQ0zQ713kZGF6xQQvjJ -y
Using input JSON:
{
"stage-common.bam": {
"$dnanexus_link": "file-G8V38KQ0zQ713kZGF6xQQvjJ"
}
}
Calling workflow-GFqF27j07GyZ33JX4vzqgK32 with output destination
project-GFPQvY007GyyXgXGP7x9zbGb:/
Analysis ID: analysis-GFqF7Zj07GyZQ957Jy822gQYversion 1.0
task slice_bam {
input {
File bam
String docker_img
}
command <<<
set -ex
samtools index "~{bam}"
mkdir slices
for i in $(seq 22); do
samtools view -b -o "slices/$i.bam" "~{bam}" "chr${i}"
done
>>>
runtime {
docker: docker_img
}
output {
File bai = "~{bam}.bai"
Array[File] slices = glob("slices/*.bam")
}
}version 1.0
task count_bam {
input {
File bam
String docker_img
}
command <<<
samtools view -c "~{bam}"
>>>
runtime {
docker: docker_img
}
output {
Int count = read_int(stdout())
}
}version 1.0
import "./tasks/slice_bam.wdl" as task_slice_bam
import "./tasks/count_bam.wdl" as task_count_bam
workflow bam_chrom_counter {
input {
File bam
}
String docker_img = "quay.io/biocontainers/samtools:1.12--hd5e65b6_0"
call task_slice_bam.slice_bam as slice_bam {
input : bam = bam,
docker_img = docker_img
}
scatter (slice in slice_bam.slices) {
call task_count_bam.count_bam as count_bam {
input: bam = slice,
docker_img = docker_img
}
}
output {
File bai = slice_bam.bai
Array[Int] count = count_bam.count
}
}version 1.0
task wc_wdl {
input {
File input_file
}
command {
wc ~{input_file} > wc.txt
}
output {
File outfile = "wc.txt"
}
runtime {
docker: "ubuntu:20.04"
}
}$ java -jar ~/womtool.jar validate wc.wdl
Success!$ miniwdl check wc.wdl
wc.wdl
task wc$ miniwdl check wc.wdl
(wc.wdl Ln 13 Col 9) Unknown type Fiel
Fiel outfile = "wc.txt"
^^^^^^^^^^^^^^^^^^^^^^^java -jar ~/womtool.jar validate wc.wdl
Failed to process task definition 'wc' (reason 1 of 1):
No struct definition for 'Fiel' found in available structs: []
make: *** [validate] Error 1$ java -jar ~/dxCompiler.jar compile wc.wdl
[warning] Project is unspecified...using currently selected project
project-GGyG8K80K9ZKzkX812yY893V
applet-GJ3PxPj0K9Z68x1Y5zK4236B$ dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B -h
usage: dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B [-iINPUT_NAME=VALUE ...]
Applet: wc_wdl
Inputs:
input_file: -iinput_file=(file)
Reserved for dxCompiler
overrides___: [-ioverrides___=(hash)]
overrides______dxfiles: [-ioverrides______dxfiles=(file)
[-ioverrides______dxfiles=... [...]]]
Outputs:
outfile: outfile (file)$ dx run applet-GJ3PxPj0K9Z68x1Y5zK4236B \
> -iinput_file=file-GGyG8z00K9Z9GQ9jG4qB4gpX -y --watch
Using input JSON:
{
"input_file": {
"$dnanexus_link": "file-GGyG8z00K9Z9GQ9jG4qB4gpX"
}
}
Calling applet-GJ3PxPj0K9Z68x1Y5zK4236B with output destination
project-GGyG8K80K9ZKzkX812yY893V:/
Job ID: job-GJ3Q0V80K9Z54K2X9Bzf2v0B
Job Log
-------
Watching job job-GJ3Q0V80K9Z54K2X9Bzf2v0B. Press Ctrl+C to stop watching.$ dx describe job-GJ3Q0V80K9Z54K2X9Bzf2v0B --json | jq .output
{
"outfile": {
"$dnanexus_link": "file-GJ3Q10Q0b0qvyB6fG7pgx0bX"
}
}$ dx cat file-GJ3Q10Q0b0qvyB6fG7pgx0bX
8590 86055 513523 /home/dnanexus/inputs/input1217954139984307828/scarlet.txtThe cloud_workstation app provides a Linux (Ubuntu) terminal running in the cloud, which is the same base execution environment for all DNAnexus apps. This is used most often for testing application code and building Docker images. I especially favor the cloud workstation whenever I need to work with large data files that I don't wish to copy to my local disk (laptop) as the transfer speeds are internal to AWS rather than over the open internet. If you have previously been limited to HPC environments where sysadmins determine what software may or may not be installed, you will find that you have sudo privileges to install any software you like, via apt, downloading pre-built binaries, or building from source code.
In order to run cloud workstation, you will need to set up a ssh key pair. You can do this by running the following command
Here is the start of the usage for the app:
As noted in the following usage, the default timeout is one hour, but can be changed if you need to.
In the preceding command, I also use the following flags from dx run:
-imax_session_length="2h": changes the max session length to 2 hours
-y|--yes: Do not ask for confirmation before launching job
--ssh: Configure the job to allow SSH access and connect to it after launching. Defaults --priority to high.
By default, this app will choose an 8-core in instance type such as "mem1_ssd1_v2_x8" (16G RAM, 200G disk) for AWS:us-east-1. This is usually adequate for my needs, but if I need more memory or disk space, I can specify any valid the --instance-type argument:
This is actually an argument to dx run, not the cloud workstation app. You can use this argument with any app to override the default instance chosen by the app developer.
The app produces no outputs. In the following sections, I want to focus on the inputs.
As noted in the following usage, the default timeout is one hour.
You can set the usage to a different length by doing the following command, which sets the limit for 2 hours:
When on the workstation, you can find how much time is left using dx-get-timeout:
If you would like to extend the time left, use dx-set-timeout with the same values shown previously for session length. For example, you can set the timeout back to 2 hours and verify that you now have 2 hours left:
You can initiate the app with any files you want copied to the instance:
One of the main use cases for the cloud workstation is working with large files, and I will mostly use dx download on the instance to download what I want. An especially important case is when I want to download a file to STDOUT rather than to a local file, in which case I would not want to initiate the app using this input. For example, when dealing with a tarball of an entire Illumina BCL run directory, I would prefer to download to STDOUT and pipe this into tar:
The alternative would require at least twice the disk space (to download the tarball and then expand the contents).
You can save the state of a workstation---called a "snapshot"---and start a new workstation using that saved state:
For instance, you may go through a lengthy build of various packages to create the environment you need to run some application that will be lost when the workstation stops.
To demonstrate, I will show that the Python module "pandas" is not installed by default:
I use python3 -m pip install pandas to install the module, then dx-create-snapshot to save the state of the machine, which shows:
I can use the file ID of the snapshot to reconstitute my environment:
Now I find that "pandas" does exist on the image:
You can use a snapshot file ID as an asset for native applets.
By default, this app will choose an 8-core in instance type such as "mem1_ssd1_v2_x8" (16G RAM, 200G disk) for AWS:us-east-1. This is usually adequate for my needs, but if I need more memory or disk space, I can specify any valid the --instance-type argument:
This is actually an argument to dx run, not the cloud workstation app. You can use this argument with any app to override the default instance chosen by the app developer.
When the app secures an instance, you will be greeted by the following messages. The first shows the job ID, instance type, project ID, and the workspace container:
The next part explains that you are running the terminal multiplexer:
This means that pressing Ctrl-A to jump to the beginning of the line in the terminal will trigger the following Byobu configuration screen where you are prompted to choose whether to use Screen or Emacs mode:
If you choose Screen mode, then Byobu will emulate keystrokes, such as:
Ctrl-A, N: Next window
Ctrl-A, C: Create window
Ctrl-A, ": show list of windows
The next message is perhaps the most important:
This means that if you lose your connection to the workstation, the job will still continue running until you manually terminate it or the maximum session length is reached. For instance, you may lose your internet connection or accidentally close your terminal application. Also, your connection will be lost after an extended period of inactivity. To reconnect, use dx find jobs to find the job ID of the cloud workstation, and then use dx ssh <job-id> to pick up the Byobu session with all your work and windows in the same state.
Next, the message recommends you press F1 to read more about Byobu and how to switch screens:
Finally, the message reminds you that you have sudo privileges to install anything you like. The dx-toolkit is also installed, so you can run all dx commands:
The preceeding tip to use htop is especially useful. When developing application code, I will typically choose an instance type I estimate is appropriate to a task. I will download sample input files, install all the required software, run the commands needed for the app, then open a new screen (Ctrl-A, C) and run htop there to see resource usage.
This tip is also useful once you learn to build and run apps. You can shell into a running job using dx ssh <job-id> and connect to Byobu. To see how the system is performing in real time to a given input, you can use Ctrl-A, C to open a new screen to run htop.
The cloud workstation comes with several programming languages installed:
bash 5.x
Python 3.x
R 3.x
Perl 5.x
Note that you are not your DNAnexus username on the workstation but rather the dnanexus user:
This is not to be confused with your DNAnexus ID:
Like any job, a cloud workstation must be run in the context of a DNAnexus project; however, if I execute dx ls on the workstation, I will not see the contents of the project. This is because the containing workspace is created for the job, which I can see the "Current workspace" value in dx env:
I can see more details by searching the workstation's environment for all the variables starting with DX:
The $DX_PROJECT_CONTEXT_ID variable contains the project ID:
I can run use this variable to see the parent project:
Any files left on the workstation after termination will be permanently destroyed. If I use dx upload to save my work, it will go into the workspace's container, not the parent project. To resolve this, I use the $DX_PROJECT_CONTEXT_ID variable to upload some output file to a results folder in the parent project:
Alternatively, I can run remove the DX_WORKSPACE_ID variable and change directories into the $DX_PROJECT_CONTEXT_ID:
After the preceeding command, dx ls and dx upload will reference the parent project rather than the container workspace.
The ttyd app runs a similar Linux terminal in the browser. Here are some differences to note:
You will enter as the root user.
Commands like dx ls and dx upload will default to the project not a container workspace.
There is no maximum session length, so ttyd runs until manually terminated. This can be costly if you forget to shut down the terminal.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Here we will import the nf-core Sarek pipeline from github to demonstrate the functionality, but you can import any Nextflow pipeline from Github, not just nf-core ones!
Go to a DNAnexus project. Click Add and in the drop down menu select 'Import Pipeline/Workflow'
Next enter the required information (see below) and click 'Start Import'
The github url is from the url of the Sarek github repo (not what is in 'Clone' in the repo)
Make sure there is no slash after 'sarek' in the URL as it will cause the importer to fail.
Choose your folder in the USERS folder to output the applet to.
To see the possible releases to use, in the github project click 'Tags'. If you leave this part blank it will use the 'main' branch for that repo.
Click the 'Monitor' tab in your project to see the running/finished import job
You should see your applet in the the output folder that you specified in your project
You can see the version of dxpy that it was built with by looking at the job log for the import job
To do this click 'View Log' on the right hand side of the screen
The job log shows that the version of dxpy used here is dxpy v0.369.0
We will run the test profile for sarek which should take 40 mins to 1 hour to run. The test profile inputs are the nextflow outdir and -profile test,docker ()
Click one of the sarek applets that you created
Choose the platform output location for your results.
Click on 'Output to' then make a folder or choose an existing folder. I choose the outputs folder.
Click 'Next'
Output directory considerations
Specify the nextflow output directory.
This is a directory local to the machine that Nextflow will be running on not a DNAnexus path.
The outdir path must start with ./ or have no slashes in front of it so that the executor will be able to make this folder where its is running on the head node. For example ./results and results are both valid but /results or things like dx://project-xx:/results etc will not produce output in your project. Once the dnanexus nextflow executor detects that all files have been written to this folder (and thus all subjobs completed), it will copy this folder to the specified job destination on platform. In the event that the pipeline fails before completion, this folder will not be written to the project.
Here I have chosen to place the nextflow output files in a directory on the head node of the run named ./test. This creates an outdir called test.
Thus once this job completes, my results will be in dx://project-xxx:/outputs/test
More details about this are found in our Documentation.
Where test is the folder that was copied from the head node of the Nextflow run to the destination that I specified for it on platform.
Scroll down and in 'Nextflow Options', 'Nextflow Run Options'
type -profile test,docker
You must use Docker for all Nextflow pipelines run on DNAnexus. Every nf-core pipeline has a Docker profile in it's nextflow.config file. You need to specify -profile docker in the Nextflow run options ('Nextflow Run Options' on UI, -inextflow_run_opts in CLI) of the applet CLI or UI to get it to use Docker containers for each process.
Then click 'Start Analysis'. You will be brought to this screen
Go to the Monitor tab to see your running job.
Note! The estimated cost per hour is the cost to run the head node only! Each instance of the nextflow processes (subjobs) will have their own instances with their own costs.
Select a project to build the applet in
and choose the number associated with your project.
Or select your project using its name or project ID
Replace the folder name with your folder name
This will place the sarek applet in a folder called sarek_v3.4.0_cli_import in the /USERS/FOLDERNAME folder in the project.
You can see the job running/completed in the Monitor tab of your project.
If you are using a private github repository, you can supply a git credentials file to dx build using the --git-credentials option. The git credentials file has the following format.
It must be stored in a project on platform. For more information on this file see .
Build the Nextflow pipeline from a folder on your local machine
This approach is useful for building pipelines that you have built yourself into Nextflow applets and for pipelines that you do not have in a github repository.
It is also useful if you need to alter something from a public repo locally (e.g. change some code in a file to fix a bug without fixing it in the public repo) and want to build using the locally updated directory instead of the git repo.
Additionally, if you want to use the most up-to-date dxpy version, you will need to use this approach. Sometimes the workers executing the remote repository builds can be a version or two behind the latest release of dxpy. You may want to use the latest version of dxpy for instance if there was a bug in the Nextflow executor bundled with an older dxpy version that you do not want to run into.
For example, running dx version shows that I am using dx v0.370.2 which is what will be used for the applet we build with this approach.
However, we saw the UI and CLI import jobs used dxpy v0.369.0, which is .
Clone the git repository
Once you have selected the project to build in using dx select, then build using the --nextflow flag
You should see an applet ID if it has built successfully.
Note that this approach does not generate a job log and it will use the version of dxpy on your local machine. So if using dxpy v0.370.2, then the applet will be packaged with this version of dxpy and its corresponding version of nextflow (23.10.0 in this case)
To see the help command for the applet:
Use dx run <applet-name/applet-ID> -h
or using it's applet ID (useful when multiple versions of the applet with the same name as each version will have it's own ID). Also you can run an applet using its ID from anywhere in the project but if using its name you must dx cd etc to its folder before using it.
Excerpt of the help command
Run command
To run this, copy the command to your terminal and replace 'USERS/FOLDERNAME' with your folder name
Then press Enter.
You should see
Type y to proceed.
You can also add '-y' to the run command to get it to run without prompting e.g.,
You can track the progress of your job using the 'Monitor' tab of your project in the UI
Once the run successfully completes, your results will be in where test_run_cli is the folder on the head node of the nextflow run that is copied to the 'outputs' folder in your project on platform.
Note that as destination is a DNAnexus command and not a nextflow one it starts with '--' and does not have an '=' after it.
By default the DNAnexus executor will only run 5 subjobs in parallel. You can change this by passing the -queue-size flag to nextflow_run_opts with the number you require. There is a limit of 100 subjobs per user per project for most users but you can give any number up to 1000 before it will give you an error as noted in the . For example, if you know that you are passing 20 files to a run and that only a few of subjobs can be run on all 20 files at a time you could set the queueSize to 60.
Lets change it to 20 for our nf-core Sarek run. Then the command would be
You can also set the queue size when building your own applets in the nextflow.config. To change the default from 5 to 20 for your applet at build time, add this line to your nextflow.config
or (equivalent)
However, you can change the queue size at runtime, regardless of if it is mentioned in your nextflow.config or not, by passing -queue-size X where X is a number between 1 and 1000 to the nextflow run options.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.
dx ssh_config$ dx run cloud_workstation -h
usage: dx run cloud_workstation [-iINPUT_NAME=VALUE ...]
App: Cloud Workstation
Version: 2.2.1 (published)
This app sets up a cloud workstation which you can access by running the
applet with the --ssh or --allow-ssh flags
See the app page for more information:
https://platform.dnanexus.com/app/cloud_workstationMaximum Session Length (suffixes allowed: s, m, h, d, w, M, y):
[-imax_session_length=(string, default="1h")]
The maximum length of time to keep the workstation running.
Value should include units of either s, m, h, d, w, M, y for
seconds, minutes, hours, days, weeks, months, or years
respectively.$ dx run -imax_session_length="2h" app-cloud_workstation --ssh -yCtrl-A, K: Kill/delete window$ dx run app-cloud_workstation --instance-type mem1_ssd2_v2_x72 --ssh -yMaximum Session Length (suffixes allowed: s, m, h, d, w, M, y):
[-imax_session_length=(string, default="1h")]
The maximum length of time to keep the workstation running.
Value should include units of either s, m, h, d, w, M, y for
seconds, minutes, hours, days, weeks, months, or years
respectively.$ dx run -imax_session_length="2h" app-cloud_workstation --ssh -ydnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ dx-get-timeout
0 days 1 hours 42 minutes 50 secondsdnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ dx-set-timeout 1d
dnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ dx-get-timeout
0 days 1 hours 59 minutes 57 secondsFiles: [-ifids=(file) [-ifids=... [...]]]
An optional list of files to download to the cloud workstation
on startup.$ dx download file-XXXX -o - | tar xvSnapshot: [-isnapshot=(file)]
An optional snapshot file to restore the workstation environment.dnanexus@job-GXfvYxj071x5P87Fxx6f5k47:~$ python3
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'Created snapshot: project-GXY0PK0071xJpG156BFyXpJF:July_11_2023_23_54.snapshot
(file-GXfygVj071xGjVfg1KQ9B7PP)$ dx run app-cloud_workstation -isnapshot=file-GXfygVj071xGjVfg1KQ9B7PP -y --sshdnanexus@job-GXfyj58071xB4VJ9X0yk75k3:~$ python3
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> help(pd.read_csv)$ dx run app-cloud_workstation --instance-type mem1_ssd2_v2_x72 --ssh -yWelcome to DNAnexus!
This is the DNAnexus Execution Environment, running job-GXfvYxj071x5P87Fxx6f5k47.
Job: Cloud Workstation
App: cloud_workstation:main
Instance type: mem1_ssd1_v2_x8
Project: kyclark_test (project-GXY0PK0071xJpG156BFyXpJF)
Workspace: container-GXfvYyj0p4QgFgP4zZyBFv7Y
Running since: Tue Jul 11 21:31:40 UTC 2023
Running for: 0:01:37
The public address of this instance is ec2-3-90-239-144.compute-1.amazonaws.com.You are running byobu, a terminal session manager.Configure Byobu's ctrl-a behavior...
When you press ctrl-a in Byobu, do you want it to operate in:
(1) Screen mode (GNU Screen's default escape sequence)
(2) Emacs mode (go to beginning of line)
Note that:
- F12 also operates as an escape in Byobu
- You can press F9 and choose your escape character
- You can run 'byobu-ctrl-a' at any time to change your selection
Select [1 or 2]:If you get disconnected from this instance, you can log in again;
your work will be saved as long as the job is running.For more information on byobu, press F1.
The job is running in terminal 1. To switch to it, use the F4 key
(fn+F4 on Macs; press F4 again to switch back to this terminal).Use sudo to run administrative commands.
From this window, you can:
- Use the DNAnexus API with dx
- Monitor processes on the worker with htop
- Install packages with apt-get install or pip3 install
- Use this instance as a general-purpose Linux workstation
OS version: Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1031-aws x86_64)$ whoami
dnanexus$ dx whoami
kyclark$ dx env
Auth token used 4Gv26bY2YJ6gJjxGkV6Qg62B51X1VF7kq3gPZp2V
API server protocol http
API server host 10.0.3.1
API server port 8124
Current workspace container-GXfvYyj0p4QgFgP4zZyBFv7Y
Current folder None
Current user None$ env | grep DX
DX_APISERVER_PROTOCOL=http
DX_JOB_ID=job-GXfvYxj071x5P87Fxx6f5k47
DX_APISERVER_HOST=10.0.3.1
DX_WATCH_PORT=8090
DX_WORKSPACE_ID=container-GXfvYyj0p4QgFgP4zZyBFv7Y
DX_PROJECT_CACHE_ID=container-GXfvYxj071x5P87Fxx6f5k48
DX_SNAPSHOT_FILE=null
DX_SECURITY_CONTEXT={"auth_token_type": "Bearer", "auth_token": "4Gv26bY2YJ6gJjxGkV6Qg62B51X1VF7kq3gPZp2V"}
DX_RESOURCES_ID=container-GKyz0G00FY38jv564gjXxb46
DX_THRIFT_URI=query.us-east-1.apollo.dnanexus.com:10000
DX_APISERVER_PORT=8124
DX_DXDA_DOWNLOAD_URI=http://10.0.3.1:8090/F/D2PRJ/
DX_PROJECT_CONTEXT_ID=project-GXY0PK0071xJpG156BFyXpJF
DX_RUN_DETACH=1$ echo $DX_PROJECT_CONTEXT_ID
project-GXY0PK0071xJpG156BFyXpJF$ dx ls $DX_PROJECT_CONTEXT_ID:/$ dx upload output.txt --path $DX_PROJECT_CONTEXT_ID:/results$ unset DX_WORKSPACE_ID && dx cd $DX_PROJECT_CONTEXT_IDClick 'Launch Analysis'.


https://github.com/nf-core/sarekdx select # press enterdx select project-ID
#or
dx select my_project_namedx build --nextflow --repository https://github.com/nf-core/sarek --repository-tag 3.4.0 --destination project-ID:/USERS/FOLDERNAME/sarek_v3.4.0_cli_importproviders {
github {
user = 'username'
password = 'ghp_xxxx'
}
}dx --version
#dx v0.370.2git clone --branch 3.4.0 https://github.com/nf-core/sarek.git
# Here I change the folder name to something with the version in it to help me keep track of different versions of sarek
mv sarek sarek_v3.4.0_clidx build --nextflow sarek_v3.4.0_cli --destination project-ID:/USERS/FOLDERNAME/sarek_v3.4.0_cliapplet-xxxdx run sarek_v3.4.0_ui -h dx run applet-ID -husage: dx run sarek_v3.4.0_ui [-iINPUT_NAME=VALUE ...]
Applet: sarek
sarek
Inputs:
outdir: [-ioutdir=(string)]
(Nextflow pipeline required)
step: [-istep=(string)]
(Nextflow pipeline required) Default value:mapping The pipeline starts
from this step and then runs through the possible subsequent steps.
input: [-iinput=(file)]
(Nextflow pipeline optional) A design file with information about the
samples in your experiment. Use this parameter to specify the location
of the input files. It has to be a comma-separated file with a header
row. See [usage docs](https://nf-co.re/sarek/usage#input). If no
input file is specified, sarek will attempt to locate one in the
`{outdir}` directory. If no input should be supplied, i.e. when --step
is supplied or --build_from_index, then set --input false
...dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli' -inextflow_run_opts='-profile test,docker' --destination 'project-ID:/USERS/FOLDERNAME'
Using input JSON:
{
"outdir": "./test_run_cli",
"nextflow_run_opts": "-profile test,docker"
}
Confirm running the executable with this input [Y/n]:dx run sarek_v3.4.0_ui -ioutdir='./test_run_cli' -inextflow_run_opts='-profile test,docker' --destination 'project-ID:/USERS/FOLDERNAME' -ydx run sarek_v3.4.0_ui -ioutdir='./test_run_cli_qs' -inextflow_run_opts='-profile test,docker -queue-size 20' --destination 'project-ID:/USERS/FOLDERNAME'executor.queueSize = 20 executor {
queueSize = 20
}












In this applet, I'll show how to count the number of reads in a SAM or BAM file using samtools. The (Sequence Alignment Map) is a tab-delimited text description for sequence alignments, and the BAM format is the same data but stored in binary format for more compression. As the SAM format uses a line break to delineate each record, counting the alignments could be as simple as using wc -l; however, the BAM format requires a program like samtools to read the input file, so I'll show how to install this into the applet's execution environment.
A minimal native applet requires just two files that exist in a directory with the same name as the applet:
dxapp.json: a JSON-formatted
a bash or Python program to execute
I'll use dx-app-wizard to create a skeleton applet structure with these files:
First, I must give my applet a name. The prompt shows that I must use only letters, numbers, a dot, underscore, and a dash. As stated earlier, this applet name will also be the name of the directory, and I'll use samtools_count:
Next, I'm asked for the title. Note that the prompt includes empty square brackets ([]), which contain the default value if I press Enter. As title is not required, it contains the empty string, but I will provide an informative title:
Likewise, the summary field is not required:
The version is also optional, and I will press Enter to take the default:
This applet requires a single input, as shows in Table 1.
When prompted for the first input, I'll enter the following:
The name of the input will be used as a variable in the bash code, so I will use only letters, numbers, and underscores as in bam or bam_file.
The label is optional, as noted by the empty square brackets.
The types include primitives like integers, floating-point numbers, and strings, as well as arrays of primitive types.
When prompted for the second input, press Enter:
As showing in Table 2, the applet will produce a single output file containing the number of alignments:
When prompted for the first output name, I enter the following:
This name will also become a bash variable, so best practice is to use letters, numbers, and underscores.
The label is optional.
The class must be from the preceeding list. To be reminded of the choices, press the Tab key twice.
When prompted for the second output, press Enter:
Here are the final settings I'll use to complete the wizard:
Applets are required to set a maximum time for running to prevent a job from running an excessively long time. While some applets may legitimately need days to run, most probably need something in the range of 12-48 hours. As noted in the prompt, I can use m, h, or d to specify minutes, hours, or days, respectively:
For the template language, I must select from bash or Python for the program that is executed when the applet starts. The applet code can execute any program available in the execution environment, including custom programs written in any language. I will choose bash:
Next, I determine if the applet has access to the internet and/or the parent project. Unless the applet specifically needs access, such as to download a file at runtime, it's best to answer no:
Lastly, I must specify a default instance type. The prompt includes an abbreviated list of . The final number indicates the number of cores, e.g., _x4 indicates 4 cores. The greater the number of cores, the more available memory and disk space. In this case, a small 4-core instance is sufficient:
The user is always free to override the instance type using the --instance-type option to dx run.
The final output from dx-app-wizard is a summary of the files that are created:
This file should contain applet implementation details.
This file should contain user help.
The answers from dx-app-wizard are used to create the app metadata.
The resources directory is for any additional files you want available on the runtime instance.
The contents of the resources directory will be placed into the root directory of the runtime instance. For instance, if you create a file resources/my_tool, then it will be available on the runtime instance as /my_tool. You would either need to reference the full path (/my_tool) or expand the $PATH variable to include /. Best practice is to create the directory structure resources/usr/local/bin/, and then the file will be at /usr/local/bin/my_tool as /usr/local/bin normally part of $PATH.
Let's look at the dxapp.json that was generated by dx-app-wizard. Note that this is a simple text file that you can edit at any time:
The inputSpec has a section for patterns where I will add a few Unix file globs to indicate acceptable file suffix:
The outputSpec needs no update:
The runSpec contains the timeout along with the indication to use bash to run src/samtools_count.sh. If you ever wanted to change the name or location of the run script, update this section:
Finally, the regionalOptions indicates the default runtime instance.
In the preceeding runSpec, note that the applet will run on Ubuntu 20.04. This instance will include dx-toolkit and several programming languages including bash, Python 3.x, Perl 5.x, and R 3.x. Anything else needed by the applet must be installed. Edit the runSpec to include the following execDepends to install samtools at runtime using the apt package manger:
The package_manager may be one of the following:
apt (Ubuntu)
pip (Python)
gem (Ruby)
Some caveats:
This runs apt install every execution, which is fine for fast installs. Some packages may take 5-15 minutes to install, in which case you will pay for those extra minutes on every run.
Installs current version in the package manager, which may be old. For instance, apt install v1.10 as of this writing while the current version is v1.17.
Your applet could break if the program changes if the package manager updates to a newer version.
An alternative is to build an asset that the applet uses. Assets have many advantages, including:
Build asset once
Runtime installs are quick decompression of tarballs
Assets are static and cannot break your code
Create a new folder with the name of your asset.
Then, create the file dxasset.json in the folder with the following contents:
When I execute dx build_asset in the folder, a new job will run to build the asset:
As noted, the record ID of the asset can now be used in an assetDepends section, which should replace the execDepends:
Execute dx build_asset inside this directory to build the asset into the selected project. (You can also use the --destination option to specify where to place the asset file, which will be a tarball.)
The build process will create a new job to build the asset.
The default src/samtools_count.sh contains many lines of comments to guide you in writing your application code. Update the file to the following:
This is the colloquially named "shebang" line that indicates this is a bash script.
Althought it's not a requirement that app code be contained in a main() function, it is best practice.
The original template uses echo to show you the runtime value of the inputs.
Remember that the $bam variable matches the name of the input in dxapp.json. If you ever wish to change this, be sure to update both the script and the JSON.
Run dx build to create the applet on the DNAnexus platform.
If you have previous built the applet, you will be prompted to use the flags -f|--overwrite or -a|--archive flags:
As habit, I always use -f to force the build:
Without the -d|--destination option, the applet will be placed into the root directory of the project. I like to make an apps folder to hold my applets:
TIP: Best practice is to create folders for applets, resources, assets, etc.
I'd like to discuss this code a little more. In bash, the echo command will print to the console. As in any language, this is a great way to see what's happening when your code is running. In the following line, the $bam variable will only have a value at runtime, so you will not be able to run this script locally:
When I execute this code, I see output like the following:
That means that the following line:
Will execute the following command at runtime:
Take a look at the usage for dx download to remind yourself that the -o option here is directing that the output file name be input.bam:
The next line of code executes samtools view with the -c. Execute samtools view -h to read the documentation:
I often use a cloud workstation to work through app building. It's the same execution environment (Ubuntu Linux), so I will install any programs I need there, download sample input files, run commands and verify the behavior and output of the tools, etc.
If I download the input file NA12878.bam (file-FpQKQk00FgkGV3Vb3jJ8xqGV), I can run the following command to see that there are 60,777 aligments:
I can use Unix output redirection with > to place the output into the file counts.txt and cat to verify the output:
Therefore, the following line of code from the bash script place the count of the input BAM file into counts.txt:
Next, I upload the counts.txt file to the platform using the --brief option that will only show the new file ID:
In bash, I can use either backticks (``) or $() to capture the results from a command, so the following line captures the file ID into the variable counts_id:
I use add this new file ID as an output from the job using dx-jobutil-add-output:
Here is the last command of the script that sets the counts output variable defined in the dxapp.json to the new $counts_id value:
In the preceeding applet, the output filename is always counts.txt. It would be better for each output file to use the name of the input BAM. When I defined the bam input, I get four variables:
bam: the input file ID
bam_path: the default path to the downloaded input file
bam_name: the filename, also the output of basename($bam_path)
The default patterns for a file input in dxapp.json is ["*"]. This matches the entire input filename, causing bam_prefix to be the empty string.
TIP: Always be sure to set patterns to the expected file extensions.
Given an input file of NA12878.bam, the following code will create an output file called NA12878.txt:
Print out the additional variables.
Download the input file to the filename. The -o option here is superfluous as the default behavior is to download the file to it's filename. In the preceeding example, I saved it to the filename input.bam.
Define the variable outfile to use root of the input filename.
When I run this code, I can see the values of the other input file variables:
The bam_path value is the default path to write the bam file if I were to use dx-download-all-inputs. In this case, I used dx download with the -o option to write it to a file in the current working directory, so there is no file at that path.
There are two ways to download the input files: one at a time or all at once. So far, I've shown the first way using dx download. The second way uses dx-download-all-inputs to download all the input files to the directory /home/dnanexus/in. This will contain a directory for each file input, so the bam input file will be placed into /home/dnanexus/in/bam as shown for the $bam_path in the preceeding section. If the input is an array:file, there will be additional numbered subdirectories for each of the runtime values.
Following is the usage:
I can change my code to use this:
Download the input file to the default location.
Use the $bam_prefix variable (e.g., NA12878) to create the outfile.
Use the $bam_path variable to execute samtools with the path to the in directory.
TIP: Using dx-download-all-inputs --parallel is best practice to download all input files as fast as possible.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
The src (pronounced "source") is a conventional place for source code, but it's not a requirement that code lives in this directory.
This is the bash script that will be executed when the applet is run.
The test directory is empty and will not be discussed in this section.
cpan (Perl)cran (R)
Execute samtools to count the alignments in the input file.
Upload the results file and save the new file ID.
Add the new file ID to the job's output.
bam_prefix: the filename minus any file extension defined in the patterns of the dxapp.json
samtools to the preferred output filename.Upload the output file.
bam
BAM File
file
No
NA
counts
Counts File
file
Timeout Policy
48h
Programming language
bash
Access to internet
No (default)
Access to parent project
No (default)
Instance Type
mem1_ssd1_v2_x4 (default)
$ dx-app-wizard
DNAnexus App Wizard, API v1.0.0
Basic Metadata
Please enter basic metadata fields that will be used to describe your app.
Optional fields are denoted by options with square brackets. At the end of
this wizard, the files necessary for building your app will be generated from
the answers you provide.The name of your app must be unique on the DNAnexus platform. After
creating your app for the first time, you will be able to publish new versions
using the same app name. App names are restricted to alphanumeric characters
(a-z, A-Z, 0-9), and the characters ".", "_", and "-".
App Name: samtools_countThe title, if provided, is what is shown as the name of your app on
the website. It can be any valid UTF-8 string.
Title []: Samtools CountThe summary of your app is a short phrase or one-line description of
what your app does. It can be any UTF-8 human-readable string.
Summary []: Count SAM/BAM alignmentsYou can publish multiple versions of your app, and the version of your
app is a string with which to tag a particular version. We encourage the use
of Semantic Versioning for labeling your apps (see http://semver.org/ for more
details).
Version [0.0.1]:Input Specification
You will now be prompted for each input parameter to your app. Each parameter
should have a unique name that uses only the underscore "_" and alphanumeric
characters, and does not start with a number.
1st input name (<ENTER> to finish): bam
Label (optional human-readable name) []: BAM File
Your input parameter must be of one of the following classes:
applet array:file array:record file int
array:applet array:float array:string float record
array:boolean array:int boolean hash string
Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: n 2nd input name (<ENTER> to finish):Output Specification
You will now be prompted for each output parameter of your app. Each
parameter should have a unique name that uses only the underscore "_" and
alphanumeric characters, and does not start with a number.
1st output name (<ENTER> to finish): counts
Label (optional human-readable name) []: Counts File
Choose a class (<TAB> twice for choices): file 2nd output name (<ENTER> to finish):Timeout Policy
Set a timeout policy for your app. Any single entry point of the app
that runs longer than the specified timeout will fail with a TimeoutExceeded
error. Enter an int greater than 0 with a single-letter suffix (m=minutes,
h=hours, d=days) (e.g. "48h").
Timeout policy [48h]:Template Options
You can write your app in any programming language, but we provide
templates for the following supported languages: Python, bash
Programming language: bashAccess Permissions
If you request these extra permissions for your app, users will see this fact
when launching your app, and certain other restrictions will apply. For more
information, see
https://documentation.dnanexus.com/developer/apps/app-permissions.
Access to the Internet (other than accessing the DNAnexus API).
Will this app need access to the Internet? [y/N]: n
Direct access to the parent project. This is not needed if your app
specifies outputs, which will be copied into the project after it's done
running.
Will this app need access to the parent project? [y/N]: nDefault instance type: The instance type you select here will apply to
all entry points in your app unless you override it. See https://documenta
tion.dnanexus.com/developer/api/running-analyses/instance-types for more
information.
Choose an instance type for your app [mem1_ssd1_v2_x4]:*** Generating DNAnexus App Template... ***
Your app specification has been written to the dxapp.json file. You can
specify more app options by editing this file directly (see
https://documentation.dnanexus.com/developer for complete documentation).
Created files:
samtools_count/Readme.developer.md # 1
samtools_count/Readme.md # 2
samtools_count/dxapp.json # 3
samtools_count/resources/ # 4
samtools_count/src/ # 5
samtools_count/src/samtools_count.sh # 6
samtools_count/test/ # 7
App directory created! See https://documentation.dnanexus.com/developer for
tutorials on how to modify these files, or run "dx build samtools_count" or
"dx build --create-app samtools_count" while logged in with dx.
Running the DNAnexus build utility will create an executable on the DNAnexus
platform. Any files found in the resources directory will be uploaded
so that they will be present in the root directory when the executable is run.{
"name": "samtools_count",
"title": "Samtools Count",
"summary": "Count SAM/BAM alignments",
"dxapi": "1.0.0",
"version": "0.0.1", "inputSpec": [
{
"name": "bam",
"label": "BAM File",
"class": "file",
"optional": false,
"patterns": [
"*.bam"
],
"help": ""
}
], "outputSpec": [
{
"name": "counts",
"label": "Counts File",
"class": "file",
"patterns": [
"*"
],
"help": ""
}
], "runSpec": {
"timeoutPolicy": {
"*": {
"hours": 48
}
},
"interpreter": "bash",
"file": "src/samtools_count.sh",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0"
}, "regionalOptions": {
"aws:us-east-1": {
"systemRequirements": {
"*": {
"instanceType": "mem1_ssd1_v2_x4"
}
}
}
}
}{
...
"runSpec": {
"execDepends": [
{
"name": "samtools",
"package_manager": "apt"
}
],
...
}
}{
"name": "samtools",
"title": "samtools asset",
"description": "samtools asset",
"version": "1.10",
"distribution": "Ubuntu",
"release": "20.04",
"execDepends": [
{
"name": "samtools",
"package_manager": "apt"
}
]
}$ dx build_asset
...
* samtools (create_asset_focal:main) (done) job-GXjx8yj071x69xBVz90Zypx1
kyclark 2023-07-14 16:04:27 (runtime 0:02:05)
Output: asset_bundle = record-GXjx9V008bgjZqj82f5ybf16
Asset bundle 'record-GXjx9V008bgjZqj82f5ybf16' is built and can now be used
in your app/applet's dxapp.json{
...
"runSpec": {
"assetDepends": [
{ "id": "record-GXjx9V008bgjZqj82f5ybf16" }
],
...
}
}#!/bin/bash
main() {
echo "Value of bam: '$bam'"
dx download "$bam" -o input.bam
samtools view -c input.bam > counts.txt
counts_id=$(dx upload counts.txt --brief)
dx-jobutil-add-output counts "$counts_id" --class=file
}$ dx build
{"id": "applet-GXqG4Z8071x9p1FZ81K5BjGQ"}$ dx build
Error: ('An applet already exists at /samtools_count (id
applet-GXqG4Z8071x9p1FZ81K5BjGQ) and neither -f/--overwrite
nor -a/--archive were given.',)$ dx build -f
INFO:dxpy:Deleting applet(s) applet-GXqG4Z8071x9p1FZ81K5BjGQ
{"id": "applet-GXqG5P0071xF2j1F03qv7Zz6"}$ dx mkdir apps
$ dx build -d /apps/ -f
{"id": "applet-GXqG7bQ071xKQq3JkbVjGbGv"}echo "Value of bam: '$bam'"2023-07-17 12:42:23 Samtools Count STDOUT Value of bam:
'{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}'dx download "$bam" -o input.bamdx download '{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}' -o input.bam-o OUTPUT, --output OUTPUT Local filename or directory to be used
("-" indicates stdout output); if not supplied or
a directory is given, the object's name on the
platform will be used, along with any applicable
extensions-c, --count Print only the count of matching records$ samtools view -c NA12878.bam
60777$ samtools view -c NA12878.bam > counts.txt
$ cat counts.txt
60777samtools view -c input.bam > counts.txt$ dx upload counts.txt --brief
file-GXpvky0071x6jg2ZVV3fJ5xp$ counts_id=$(dx upload counts.txt --brief)
$ echo $counts_id
file-GXqFf60071x6p2fbKYzVv9pp$ dx-jobutil-add-output -h
usage: dx-jobutil-add-output [-h] [--class [CLASSNAME]] [--array] name value
Reads and modifies job_output.json in your home directory to be a JSON hash
with key *name* and value *value*.
If --class is not provided or is set to "auto", auto-detection of the
output format will occur. In particular, it will treat it as a number,
hash, or boolean if it can be successfully parsed as such. If it is a
string which matches the pattern for a data object ID, it will encapsulate
it in a DNAnexus link hash; otherwise it is treated as a simple string.dx-jobutil-add-output counts "$counts_id" --class=file#!/bin/bash
main() {
echo "Value of bam : '$bam'" # 1
echo "Value of bam_path : '$bam_path'"
echo "Value of bam_name : '$bam_name'"
echo "Value of bam_prefix: '$bam_prefix'"
dx download "$bam" -o "$bam_name" # 2
outfile="$bam_prefix.txt" # 3
samtools view -c "$bam_name" > "$outfile" # 4
counts_id=$(dx upload "$outfile" --brief) # 5
dx-jobutil-add-output counts "$counts_id" --class=file # 6
}Value of bam : '{"$dnanexus_link": "file-FpQKQk00FgkGV3Vb3jJ8xqGV"}'
Value of bam_path : '/home/dnanexus/in/bam/NA12878.bam'
Value of bam_name : 'NA12878.bam'
Value of bam_prefix: 'NA12878'$ dx-download-all-inputs -h
usage: dx-download-all-inputs [-h] [--except EXCLUDE]
[--parallel] [--sequential]
Note: this is a utility for use by bash apps running in the DNAnexus Platform.
Downloads all files that were supplied as inputs to the app. By
convention, if an input parameter "FOO" has value
{"$dnanexus_link": "file-xxxx"}
and filename INPUT.TXT, then the linked file will be downloaded into the
path:
$HOME/in/FOO/INPUT.TXT
If an input is an array of files, then all files will be placed into
numbered subdirectories under a parent directory named for the input. For
example, if the input key is FOO, and the inputs are {A, B, C}.vcf then,
the directory structure will be:
$HOME/in/FOO/0/A.vcf
1/B.vcf
2/C.vcf
Zero padding is used to ensure argument order. For example, if there are 12
input files {A, B, C, D, E, F, G, H, I, J, K, L}.txt, the directory
structure will be:
$HOME/in/FOO/00/A.vcf
...
11/L.vcf
This allows using shell globbing (FOO/*/*.vcf) to get all the files in the
input order.
options:
-h, --help show this help message and exit
--except EXCLUDE Do not download the input with this name. (May be used
multiple times.)
--parallel Download the files in parallel
--sequential Download the files sequentially#!/bin/bash
main() {
echo "Value of bam : '$bam'"
echo "Value of bam_path : '$bam_path'"
echo "Value of bam_name : '$bam_name'"
echo "Value of bam_prefix: '$bam_prefix'"
dx-download-all-inputs # 1
outfile="$bam_prefix.txt" # 2
samtools view -c "$bam_path" > "$outfile"
counts_id=$(dx upload "$outfile" --brief)
dx-jobutil-add-output counts "$counts_id" --class=file
}To begin, you'll create a bash app to run CNVKit, which will find "genome-wide copy number from high-throughput sequencing." Create a local directory to hold your work, and consider putting the contents into a source code repository like Git.
In this example, you will:
Use various package managers to install dependencies
Build an asset
Learn to use dx-download-all-inputs and dx-upload-all-outputs
From the web interface, select "Projects → All Projects" to see your project list. Click the "New Project" button to create a new project called "CNVkit." Alternatively, use dx new project to do this from the command line. However you choose to create a project, be sure this has been selected by running dx pwd to check your current working directory and using dx select to select the project, if needed.
Inside your working directory, run the command dx-app-wizard cnvkit_bash to launch the . Optionally provide a title, summary, and version at the prompts.
The app will accept two inputs:
One or more BAM files of the tumor samples: Give this input the name bam_tumor with the label "BAM Tumor Files." For the class, choose array:file, and indicate that this is not an optional parameter.
A reference file: Give this input the name reference with the label "Reference." For the class, choose file, and indicate that this is not an optional parameter.
When prompted for the third input, press Enter to end the inputs.
Define three outputs, each of the type array:file with the following names and whatever labels you feel are appropriate:
cns
cns_filtered
plot
Press Enter when prompted for the fourth output to indicate you are finished.
Press Enter to accept the default value for the timeout policy.
Type bash for the programming language.
Type y to indicate that the app will need internet access.
Type n to indicate that the app will need access to the parent project.
You should see a message saying the app's template was created in a directory name matching the app's name. For instance, I have the following:
This is a JSON file containing metadata that will be used to create the app on the DNAnexus platform.
A stub for user documentation.
A stub for developer documentation.
A template bash script for the app's functionality.
The dxapp.json file that was created by the wizard should look like the following:
See the for a more complete understanding of all the possible fields and their implications.
CNVkit has dependencies on both Python and R modules that must be installed before running. In the dxapp.json, you can specify dependencies that can be installed with the following package managers:
apt (Ubuntu)
pip (Python)
cpan (Perl)
The Python module cnvkit can be installed via pip, but the software also requires an R module called DNAcopy that must be installed using , which must first be installed using cran. This means you'll have to manually install the DNAcopy module when the app starts.
To add these runtime dependencies, use a text editor to update the runSpec and add the following execDepends section that will install the Python cnvkit and R BiocManager modules before the app is executed:
In the inputSpec, change the patterns to match the expected file extensions:
bam_files: *.bam
reference: *.cnn
Your dxapp.json should now look like the following:
The default bash code generated by the wizard starts with a generous header of comments that you may or may not wish to keep. The default code prints the values of the input variables, then downloads the input files individually. The app code belongs in the middle, after downloading the inputs and before uploading the outputs:
Replace src/cnvkit_bash.sh this with the following code:
Rather than downloading the inputs individually as in the original template, this version downloads the all inputs in parallel with the following command:
This will create an in directory with subdirectories named according to the input names. Note that bam_files input is an array of files, so this directory will contain numbered subdirectories starting at 0 for each input file:
Similarly, the preceding code uses dx-upload-all-outputs, which expects an out directory with subdirectories named according to each of the output specifications.
Use dx pwd to ensure you are in the correct project and dx select to change projects, if necessary. If you are inside the bash source directory where the dxapp.json file exists, you can run dx build -f If you are in the parent directory, run dx build -f cnvkit_bash. Here is a sample output from successfully compiling the app:
The -f|--overwrite flag indicates you wish to overwrite any previous version of the applet. You may also want to use the -a|--archive flag to move any previous versions to an archived location. You won't need either of these flags the first time you compile, but subsequent builds will require that you indicate how to handle previous versions of the applet. Run dx build --help to learn more about build options.
Download this BAM file and add it to the inputs directory
Indicate an output directory, click the Run button, and then click the "View Log" to watch the job's progress.
You can also run the applet on the command line with the -h|--help flag to verify the inputs and outputs:
Select the input files on the web interface to note the file IDs that can be used to execute the app from the command line as follows:
You should see output from the preceding command that includes a JSON document with the inputs:
Note that you can place this JSON into a file and launch the applet with the inputs specified with the -f|--input-json-file option, as follows. Use dx run -h to learn about other command-line options:
Note the job ID from dx run, and use dx watch to watch the job to completion and dx describe to view the job's metadata. Alternatively, you can use the web platform to launch the job, using the file selector to specify each of the inputs, and then use the "Monitor" view to check the job's status, and view the output reference file when job completes.
You'll notice the applet takes quite a while to run (around 14 minutes for me) because of the module installations. You can build an asset for these installations and use this in dxapp.json. Create a directory called cnvkit_asset with the following file dxasset.json:
Also create a Makefile with the following contents:
Run dx build_asset to create the asset. This will launch a job that will report the asset ID at the end:
Update the runSpec in dxapp.json to the following:
Use dx build -f and note the new app's ID. Create a JSON input as follows:
Launch the new app from the CLI with the following command:
Use dx watch with the new job ID to see how the run now uses the asset to run faster. I see about a 10-minute difference with the asset.
You learned more ways to include app dependencies using package managers and a Makefile as well as by building an asset. The first strategy happens at runtime while the latter builds all the dependencies before the applet is run, making the runtime much faster.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Press Enter to accept the default value for the instance type or select one from the list shown.
cran (\R)gem (Ruby)
$ find cnvkit_bash -type f
cnvkit_bash/dxapp.json
cnvkit_bash/Readme.md
cnvkit_bash/Readme.developer.md
cnvkit_bash/src/cnvkit_bash.sh {
"name": "cnvkit_bash",
"title": "cnvkit_bash",
"summary": "cnvkit_bash",
"dxapi": "1.0.0",
"version": "0.0.1",
"inputSpec": [
{
"name": "bam_tumor",
"label": "BAM Tumor Files",
"class": "array:file",
"optional": false,
"patterns": [
"*"
],
"help": ""
},
{
"name": "reference",
"label": "Reference",
"class": "file",
"optional": false,
"patterns": [
"*"
],
"help": ""
}
],
"outputSpec": [
{
"name": "cns",
"label": "CNS",
"class": "array:file",
"patterns": [
"*"
],
"help": ""
},
{
"name": "cns_filtered",
"label": "CNS Filtered",
"class": "array:file",
"patterns": [
"*"
],
"help": ""
},
{
"name": "plot",
"label": "Plot",
"class": "array:file",
"patterns": [
"*"
],
"help": ""
}
],
"runSpec": {
"timeoutPolicy": {
"*": {
"hours": 48
}
},
"interpreter": "bash",
"file": "src/cnvkit_bash.sh",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0"
},
"access": {
"network": [
"*"
]
},
"regionalOptions": {
"aws:us-east-1": {
"systemRequirements": {
"*": {
"instanceType": "mem1_ssd1_v2_x4"
}
}
}
}
}"runSpec": {
"interpreter": "bash",
"file": "src/cnvkit_bash.sh",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0",
"execDepends": [
{
"name": "cnvkit",
"package_manager": "pip"
},
{
"name": "BiocManager",
"package_manager": "cran"
}
],
"timeoutPolicy": {
"*": {
"hours": 48
}
}
}{
"name": "cnvkit_bash",
"title": "cnvkit_bash",
"summary": "cnvkit_bash",
"dxapi": "1.0.0",
"version": "0.0.1",
"inputSpec": [
{
"name": "bam_tumor",
"label": "BAM Tumor Files",
"class": "array:file",
"optional": false,
"patterns": [
"*.bam"
],
"help": ""
},
{
"name": "reference",
"label": "Reference",
"class": "file",
"optional": false,
"patterns": [
"*.cnn"
],
"help": ""
}
],
"outputSpec": [
{
"name": "cns",
"label": "CNS",
class": "array:file",
"patterns": [
"*"
],
"help": ""
},
{
"name": "cns_filtered",
"label": "CNS Filtered",
"class": "array:file",
"patterns": [
"*"
],
"help": ""
},
{
"name": "plot",
"label": "Plot",
"class": "array:file",
"patterns": [
"*"
],
"help": ""
}
],
"runSpec": {
"timeoutPolicy": {
"*": {
"hours": 48
}
},
"execDepends": [
{
"name": "cnvkit",
"package_manager": "pip"
},
{
"name": "BiocManager",
"package_manager": "cran"
}
],
"interpreter": "bash",
"file": "src/cnvkit_bash.sh",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0"
},
"access": {
"network": [
"*"
]
},
"regionalOptions": {
"aws:us-east-1": {
"systemRequirements": {
"*": {
"instanceType": "mem1_ssd1_v2_x4"
}
}
}
}
}main() {
echo "Value of bam_tumor: '${bam_tumor[@]}'"
echo "Value of reference: '$reference'"
# The following line(s) use the dx command-line tool to download your file
# inputs to the local file system using variable names for the filenames. To
# recover the original filenames, you can use the output of "dx describe
# "$variable" --name".
dx download "$reference" -o reference
for i in ${!bam_tumor[@]}
do
dx download "${bam_tumor[$i]}" -o bam_tumor-$i
done
>>>>> Here is where the app code belongs <<<<<
# The following line(s) use the dx command-line tool to upload your file
# outputs after you have created them on the local file system. It assumes
# that you have used the output field name for the filename for each output,
# but you can change that behavior to suit your needs. Run "dx upload -h"
# to see more options to set metadata.
cns=$(dx upload cns --brief)
cns_filtered=$(dx upload cns_filtered --brief)
plot=$(dx upload plot --brief)
# The following line(s) use the utility dx-jobutil-add-output to format and
# add output variables to your job's output as appropriate for the output
# class. Run "dx-jobutil-add-output -h" for more information on what it
# does.
dx-jobutil-add-output cns "$cns" --class=file
dx-jobutil-add-output cns_filtered "$cns_filtered" --class=file
dx-jobutil-add-output plot "$plot" --class=file
}#!/bin/bash
# Set pragmas to print commands and fail on errors
set -exuo pipefail
# Install required R module
Rscript -e "BiocManager::install('DNAcopy')"
# Verify the value of inputs
echo "Value of bam_tumor: '${bam_tumor[@]}'"
echo "Value of reference: '$reference'"
# Place all inputs into the "in" directory
dx-download-all-inputs --parallel
# Use "_path" versions of inputs for file paths
cnvkit.py batch \
${bam_tumor_path[@]} \
-r ${reference_path} \
-p $(expr $(nproc) - 1) \
-d cnvkit-out/ \
--scatter
# Make out directories for each output spec
mkdir -p ~/out/cns/ ~/out/cns_filtered/ ~/out/plot/
# Move CNVkit outputs to the "out" directory for upload
mv cnvkit-out/*.call.cns ~/out/cns_filtered/
mv cnvkit-out/*.cns ~/out/cns/
mv cnvkit-out/*-scatter.png ~/out/plot/
# Upload and annotate all output files
dx-upload-all-outputs --paralleldx-download-all-inputs --parallelin/bam_files/0/...
in/bam_files/1/...
in/reference/...$ dx build -f
{"id": "applet-GFyV3kj0VGFkV8k04f3K11QY"}$ dx run applet-GFyV3kj0VGFkV8k04f3K11QY -h
usage: dx run applet-GFyV2G8054JBQXY64g4F7ZKk [-iINPUT_NAME=VALUE ...]
Applet: cnvkit_bash
cnvkit_bash
Inputs:
BAM Tumor Files: -ibam_tumor=(file) [-ibam_tumor=... [...]]
Reference: -ireference=(file)
Outputs:
CNS: cns (array:file)
CNS Filtered: cns_filtered (array:file)
Plot: plot (array:file)$ dx run -y --watch applet-GFyV3kj0VGFkV8k04f3K11QY \
-ibam_tumor=file-GFxXjV006kZVQPb20G85VXBp \
-ireference=file-GFxXvpj06kZfP0QVKq2p2FGF \
--destination /outputsUsing input JSON:
{
"bam_tumor": [
{
"$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
}
],
"reference": {
"$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
}
}$ dx run -y --watch applet-GFyV3kj0VGFkV8k04f3K11QY \
-f cnvkit_bash/inputs.json \
--destination /outputs{
"name": "cnvkit_asset",
"title": "cnvkit_asset",
"description": "cnvkit_asset",
"version": "0.0.1",
"distribution": "Ubuntu",
"release": "20.04",
"execDepends": [
{
"name": "cnvkit",
"package_manager": "pip"
},
{
"name": "BiocManager",
"package_manager": "cran"
}
]
}SHELL=/bin/bash -exuo pipefail
all:
sudo Rscript -e "BiocManager::install('DNAcopy')"Asset bundle 'record-GFyVY000X1ZK3yGg4qv32GXv' is built and can now be used
in your app/applet's dxapp.json "runSpec": {
"timeoutPolicy": {
"*": {
"hours": 48
}
},
"assetDepends": [{"id": "record-GFyVY000X1ZK3yGg4qv32GXv"}],
"interpreter": "bash",
"file": "src/cnvkit_bash.sh",
"distribution": "Ubuntu",
"release": "20.04",
"version": "0"
},$ cat inputs.json
{
"bam_tumor": [
{
"$dnanexus_link": "file-GFxXjV006kZVQPb20G85VXBp"
}
],
"reference": {
"$dnanexus_link": "file-GFxXvpj06kZfP0QVKq2p2FGF"
}
}$ dx run applet-GFyVppQ0VGFxvvx44j43YyPz -f inputs.json -yTo begin, you'll code a "Hello, World!" workflow that captures the output of a command into a file. WDL syntax may look familiar if you know any C-family language like Java or Perl. For example, keywords like workflow and task are used to define blocks of statements contained inside matched curly braces ({}), and variables are defined using a data type like String or File.
In this example, you will:
Write a simple workflow in WDL
Learn two ways to capture the standard out (STDOUT) of a command block
To see this in action, make a hello directory for your work, and inside that create the file workflow.wdl with the following contents:
The states that the following WDL follows the specification.
The keyword defines a workflow name. The contents of the workflow are enclosed in matched curly braces.
The block describes the parameters for the workflow.
WDL defines several you can use to describe an input value. This workflow requires a String
WDL is not whitespace dependent, so indentation is based on your preference.
In the Setup section, you should have installed the tool, which can be useful to check the syntax of your WDL. The following command shows the output when there are no problems:
Introduce an error in your WDL to see how the output changes. For instance, change the version to 2.0 and observe the error message:
Or change the call to write_greetings:
Cromwell will also find this error, but the message will be one of literally thousands of lines of output.
Note that miniwdl uses a different parser than dxCompiler, and each has slightly different ideas of what constitutes valid syntax. For example, miniwdl requires commas in between input items but dxCompiler does not. In spite of their differences, I appreciate the concise reporting of errors that miniwdl provides.
To execute this workflow locally using Cromwell, you must first create a JSON file to define the input name. Create a file called inputs.json with the following contents if you'd like to extend salutations to my friend Geoffrey:
Next, run the following command to execute the workflow:
The output will be copious and should include an indication that the command was successful and the output landed in a file in the cromwell-executions directory that was created:
You can use the cat (concatenate) command to see the contents of the file. Be sure to change the file path to the one created by your execution:
Here is another way to write the command block and capture STDOUT to a named file:
The command block here uses triple angle brackets to enclose the shell commands.
The variable must be with ~{} because of the triple angle brackets. The Unix redirect operator > is used to send the STDOUT from echo into the file out.txt.
If you execute this version, the output should show that the file out.txt was created instead of the file stdout:
I can use cat again to verify that the same file was created:
Now that you have verified that the workflow runs correctly on your local machine, it's time to compile this onto the DNAnexus platform. First, create a project in your organization and take note of the project ID. I'll demonstrate using the dx command-line interface to create a project called Workflow Test:
All the dx commands will print help documentation if you supply the -h or --help flags. For instance, run dx new project --help.
You can also use the web interface, in which case you should use dx select to switch to the project. Next, use dxCompiler to compile the workflow into a workflows directory in the new project. In the following command, the dxCompiler prints the new workflow ID upon success:
Use the web interface to inspect the new workflow as shown in Figure 1. Click on the info button (an "i" in a circle to the right of the "Run" button) to verify the workflow ID is the same as you see on the command line.
Use the "Run" button in the web interface to launch the applet as shown in Figure 2. As shown in Figue 2, I indicate the applet's outputs should written to the outputs directory.
Click on the "Analysis Inputs" view to specify a name for the greeting. In Figure 3, you see I have selected the name "Jonas."
Click "Start Analysis" to start the workflow. The web interface will show the progress of running the applet as shown in Figure 4.
Figure 5 shows check marks next to each step that has been completed. Click the button to show inputs and outputs, then click on the link to the output file, which may be stdout or out.txt depending on the version of the workflow you compiled.
Click on the output file name to view the contents of the file as shown in Figure 6.
Click on the "Monitor" view to see how long the job lasted and cost as shown in Figure 7.
You can also use the dx CLI to run the applet as shown in the following interactive session:
You can also specify the input JSON on the command line as a string or a file. In the following command, I provide the JSON as a string. Also note the use of the -y (yes) flag to have the workflow run without confirmation:
You can also place the JSON into a file like so:
You can execute the workflow with this JSON file as follows:
You may also run the workflow with the -h|--help flag to see how to pass the arguments on the command line:
For instance, you can also launch the app using the following command to greet "Keith":
However you choose to launch the workflow, the new run should be displayed in the "Monitor" view of the web interface. As shown in Figure 8, the new run finished in under 1 minute.
To find out more about the latest run, click on job's name in the run table. As shown in Figure 9, the platform will reuse files from the first run as it sees that nothing has changed. This is called "smart reuse," and you can disable this feature if you like.
You can also use the CLI to view the results of the run with the dx describe command:
Notice in the preceding output that the Output lists file-GFbPkBj0XFYgB7Vj4pF8XXBQ. You can cat the contents of this file with the CLI:
Alternately, you can download the file:
The preceding command should create a new local file called stdout or out.txt, depending on the version of the workflow you compiled. Use the cat command again to verify the contents:
You can create command-line shortcuts for all the steps of checking and buildingyour workflow by recording them as targets in a Makefile as follows:
(or a similar Make program, which you may need to install) can turn the command make local into the listed Cromwell command to run one of the workflow versions. Makefiles are a handy way to document your work and automate your efforts.
You should now be able to do the following:
Write a valid WDL workflow that accepts a string input and interpolates that string in a bash command.
Capture the standard output of a command block either using the stdout() WDL directive or by redirecting the output of a Unix command to a named file.
Define a File type output from a task
In the next section, you'll learn how to accept a file input and launch parallel processes to speed execution of large tasks.
In this chapter, you learned some more WDL functions.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Building and running nextflow pipelines on dnanexus.
A Nextflow pipeline script is structured as a folder with Nextflow scripts with optional configuration files and subfolders. Below are the basic elements of the folder structure when building a Nextflow executable:
(Required) A major Nextflow file with the extension .nf containing the pipeline. The default filename is main.nf. A different filename can be specified in the nextflow.config file using manifest.mainScript = 'myfile.nf'
(Optional, recommended) A nextflow.config file.
Create the code for fastqc-nf
We are going to add each file into a folder called fastqc-nf
This is a very simple applet containing only one process which runs FASTQC on files specified using an input samplesheet or from a folder in a project on platform.
It has only three files:
main.nf : The pipeline script file
nextflow.config : Contains config info and sets params
nextflow_schema.json : Specifies the information used by the UI/CLI run command to serve the nextflow params to the user on DNAnexus
The main.nf file
Lets look at the main.nf file. As a reminder this can be called a different name and the new name specified in the nextflow.config file using manifest.mainScript = 'myfile.nf' if needed.
main.nf
DNAnexus expects Nextflow pipelines to use the Nextflow DSL2 standard. If you have learned Nextflow after December 2022 (when Nextflow version 22.12.0 was released) you are using DSL2.
"In Nextflow version 22.03.0-edge, DSL2 became the default DSL version. In version 22.12.0-edge, DSL1 support was removed, and the Nextflow documentation was updated to use DSL2 by default."
Each process must use a Docker container to define the software environment for the process. See for more information on using docker containers in nextflow processes. Here I am using a public docker image on quay.io. This is the same docker container used by the . You might notice that the container line in the nfcore fastqc module is missing 'quay.io'. This is because this part is expected to be given in the nextflow.config using docker.registry = quay.io
An example of using publishDir multiple times in one process to send outputs to subfolders
Only the 'copy' mode of publishDir is supported on DNAnexus. If you do not specify a mode, then the DNAnexus executor will use copy by default so both of the publishDir lines in the example above are valid.
Assuming at runtime you assign outdir the value of './results', this example places all output files with the ending .html in ./results/fastqc/html and all output files with ending .zip in ./results/fastqc/zip in the head node of the nextflow run.
The entire outdir with subfolder structure intact will be copied to platform location specifed by `--destination' in the CLI or 'Output to' in the UI, once all subjobs have been completed.
Only relative paths are allowed for publishDir on DNAnexus and thus params.outdir (since this is where files are published to)
General . Do not attempt to access files in the publishDir directories from within a nextflow script as this is bad practice for many reasons. Use channels to pass files between processes.
In this example applet, I have placed the process and workflow parts in the main.nf script. For larger multi-process applets, you can place your processes in modules/workflows/subworkflows and import them into the main script as done in nfcore pipelines.
The nextflow.config file
Full File:
Explanation of Each Section:
Enable docker by default for this pipeline
Define the input parameters. You can also do this in the main.nf script but by convention nfcore pipelines do it in the nextflow.config. There are three params in this workflow, 'samplesheet' which is a file input, 'reads_dir' which is a directory path and 'outdir' which is a string defining the name of the output folder.
Here I have assigned samplesheet and reads_dir the value of null. Thus if the user does not provide a samplesheet or a reads_dir to the pipeline at runtime, the pipeline will fail. For items such as the samplesheet that should always or nearly always change at runtime, it is valuable to assign them a null value instead of a default so that a user does not accidentally run the pipeline with a default samplesheet thinking they have used a different one.
Here outdir is assigned a default of './results'. Thus, if a user does not specify a string for outdir at runtime, it will use './results'. If a user does specify an outdir, it will use the user specified one instead.
A common command to make the process fail quickly and loudly when it encounters an issue .
Error Strategy I have not defined an error strategy in the nextflow.config file. Thus, the default (both local Nextflow executor and DNAnexus executor) strategy is 'terminate'. For more detailed information on choosing an errorStrategy
queue-size I have also not defined the queueSize, so when this applet is run, a max of 5 subjobs will run at any one time in parallel, unless you pass the -queue-size flag to the nextflow_run_opts options for the applet
The nextflow_schema.json file is needed to reflect the nextflow params (--samplesheet, --reads_dir and --outdir in this case) as DNAnexus applet inputs in the CLI and UI. If it is not present, you will not get the -isamplesheet, -ireads_dir and -ioutdir options for your applet inputs. You can also use it to do parameter validation at runtime using plugins such as .
nextflow_schema.json
Once you have written your script and know your parameters, you can make the schema quite quickly using the . Note: do not put sensitive information into this builder as information in it is stored by nfcore for 2 weeks.
There is also the option of using nfcore schema tools on your computer to create it. You may need to manually add in format of either file-path and directory-path to some parameters if it doesn't do it for you.
Here we will explain how to use the
In the New Schema section, click the blue Submit button to start.
Near the top of the page, click the 'Add group' button. You need at least one group in your schema file to have it function on platform. All parameters must be placed into a group (you can do this by dragging and dropping them into the group). For example you might have one group called Inputs for all your input parameters and a group called Output for your output parameters with the appropriate parameters placed into the correct groups. Click required for every non optional parameter.
The default type of input is a string input. For file and directory path input parameters, click the little wheel to the right
To remove an input parameter for the pipeline from the UI and CLI, you can delete it from the nextflow_schema.json file, or place it in a section of the nextflow_schema.json file that is not referenced in the allOf section at the bottom of the json file.
You can also remove entire sections by removing their reference from the allOf section without deleting them from the file.
Ensure that you are in the project that you want to build the applet in using dx pwd or dx env. dx select the correct project if required.
Assuming you have the folder called fastqc-nf with these contents (main.nf is required at a minimum):
Build applet - the applet will build in the root of your project
If you are in the fastqc-nf folder on your machine you will need to cd .. back a level for the command below to work
or build using --destination to set the project level folder for the applet
or to build in root of project and just change the name to test-fastqc-nf run
You should see an output like the one below but with a different applet ID.
Use -a with dx build to archive previous versions of your applet and -f to force overwrite previous applet versions. The archived versions are placed in a folder called .Applet_archive in the root of the project.
You can see the build help using dx build -h or dx build --help
In the DNAnexus UI:
file-path will be rendered as a file-picker which enables loading of a file object by selecting it in the UI (can only select one file)
directory-path will be rendered as a string and will appear in the UI as a text box input. You can point to a directory by typing a string path such as dx://<project-id>:/test/ in the box or multiple files in a path such as dx://<project-id>:/test/*_R{1,2}.fastq.gz
string
Here is part of the fastqc-nf run setup screen
Notice how samplesheet has 'Select File' and a file icon but outdir and reads_dir have text input boxes.
-This is because samplesheet was given 'file-path' in the nextflow_schema.json, but outdir and reads_dir were given as directory-path which renders as a string input, hence the text-box.
In the DNAnexus CLI:
Run the applet with -h to see the input parameters for the applet
Excerpt of output from command above
string will appear as class string e.g., for param outdir
The default here is what we specified as the default in nextflow_schema.json. It cannot 'see' the default that we set in the nextflow.config so make sure they match when building the json.
directory-path will appear as class (string) e.g., for param reads_dir
See for more information on options for nextflow_schema.json on DNAnexus.
When placing a path to a file on the DNAnexus platform in a samplesheet it would use the format of dx://project-xxx:/path/to/file
Here is an example of a samplesheet with one sample (format of samplesheet is determined by you - this is just for illustration purposes)
In your project on platform, click the fastqc
In the run applet screen, click 'Output to' and choose your output location.
Click 'Next'
At the setup screen, either input a samplesheet or a write the path reads_dir. In the image below, I have used the reads_dir param. Replace 'project-xxx' and '/path/to/reads' with your project-ID and folder name that reads are in.
Review the rest of the inputs and change anything that you want e.g, turn on 'preserve_cache' etc.
Click start analysis
Review the name, output location etc
Click 'Launch Analysis'
Running the fastqc applet with the reads_dir as input
I am turning on preserve_cache and using -inextflow_run_opts in the command below for demonstration of how to add them to the command but neither are required here
Note that the *_{1,2}.fastq.gz is needed here for Channel.fromFilePairs to correctly pair up related files
I do not need -profile docker in -inextflow_run_opts
Running the fastqc applet with the samplesheet as input
Notice the different way that the path to the samplesheet is specified compared to the reads_dir in the previous example. You can read more about how this .
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Some of the links on these pages will take the user to pages that are maintained by third parties. The accuracy and IP rights of the information on these third party is the responsibility of these third parties.
namecall will execute the task named write_greeting. This similar to executing a function in code.
The input keyword allows you to pass arguments to the task. The workflow's name input will be passed as greet_name to the task.
The task keyword defines a task called write_greeting.
The task also defines an input block with a parameter greet_name. It would be fine to call this name because it would not conflict with workflow's name.
The command keyword defines a block of shell commands to execute. The block may be denoted with matched curly braces ({}) or triple angle brackets (<<</>>>).
The shell command echo prints a salutation to standard out, AKA STDOUT, which is the normal place for output from command-line program to appear. The variable greet_name is interpolated in the string by surrounding it with ~{} or ${} because the command block uses curly braces. When using triple angle brackets, only the first syntax is valid.
outfile is set to the file out.txt, which is created by the command block.Check the syntax of a WDL file using miniwdl.
Execute a WDL workflow on your local computer using Cromwell with the inputs defined in a JSON file.
Create a new project to contain a workflow.
Compile a WDL workflow into a DNAnexus applet using the the dxCompiler.
Run an applet using the web interface or the CLI.
Inspect the file output of an applet using the web interface or the
CLI.
Download a file from the platform.
Use a Makefile to document and automate the steps for building and running a workflow.









nextflow_schema.json(Optional) Subfolders and other configuration files. Subfolders and other configuration files can be referenced by the major Nextflow file or nextflow.config via the include or includeConfig keyword. Ensure that all referenced subfolders and files exist under the pipeline script folder at the time of building or importing the pipeline.
(Optional) A bin folder containing scripts required by the pipeline can also be used and this will be added to the PATH environment variable by nextflow - for more info see the nextflow documentation on custom scripts and tools
For other files/folders such as assets, an nf-core flavored folder structure is encouraged but not required
You should define the cpus, memory, disk (at least one of these 3), or you can use machineType and the name of the exact DNAnexus instance that you want to use for this process.
For example machineType 'mem2_ssd1_v2_x2'
If you do not specify the resources required for a process, it will by default use the mem2_ssd1_v2_x4 instance type (this is the same machine type used for the head node) and processes that require more memory than this will fail.
You should use the publishDir directive to capture the output files that you want to publish from each process. It is generally advisable to publish your output files to an output directory defined by params.outdir (naming doesn't matter once its consistent within your pipeline). You can have as many subfolders of your outdir as needed and you can use the publishDir directive multiple times in the same process to send different output files to different subfolders.
At the bottom of the popup in the Format section, for a file input, choose File path or for a directory path choose Directory path. Having these 2 correct is important for how the you specify the inputs on platform.
When you are finished building your schema file, click 'Finished', then 'Copy pipeline schema' and paste the information into a file called nextflow_schema.json in the same directory as your applet main.nf and nextflow.config files.
If you note the Schema cache ID then you can type that into the website to pull up and edit that file within 14 days.
When (string) given for parameter (used for folderpaths and strings; the input is of the 'string' class), use dx://project-XXXXX:/path/to/folder e.g., dx run fastqc-nf -ireads_dir=dx://project-GgYbKGQ0QFpxF6qkPK4KxQ6Q:/FASTQ/*_{1,2}.fastq.gz
file-path will appear as class file e.g. for param samplesheet:
When (file) is given for parameter (i.e., the input is of the 'file' class), use project-XXXXX:/path/to/file e.g., dx run fastqc-nf -isamplesheet=project-XXXXX:/samplesheet-example.csv ....
nextflow.config--name names the job







version 1.0
workflow hello_world {
input {
String name
}
call write_greeting {
input: greet_name = name
}
}
task write_greeting {
input {
String greet_name
}
command {
echo 'Hello, ${greet_name}!'
}
output {
File outfile = stdout()
}
}$ miniwdl check workflow.wdl
workflow.wdl
workflow hello_world
call write_greeting
task write_greeting$ miniwdl check workflow.wdl
(workflow.wdl Ln 0 Col 0) unknown WDL version 2.0; choices:
draft-2, 1.0, development, 1.1$ miniwdl check workflow.wdl
(workflow.wdl Ln 8 Col 5) No such task/workflow: write_greetings
call write_greetings {
^^^^^^^^^^^^^^^^^^^^^^{ "hello_world.name": "Geoffrey" }$ java -jar ~/cromwell-82.jar run --inputs inputs.json workflow.wdl{
"hello_world.write_greeting.outfile":
"/Users/[email protected]/work/srna/wdl_tutorial/hello/
cromwell-executions/hello_world/7f02fe78-4aff-4e01-95da-c9b6e021773d/
call-write_greeting/execution/stdout"
}$ cat cromwell-executions/hello_world/7f02fe78-4aff-4e01-95da-c9b6e021773d/call-write_greeting/execution/stdout
Hello, Geoffrey!version 1.0
workflow hello_world {
input {
String name
}
call write_greeting {
input: greet_name = name
}
}
task write_greeting {
input {
String greet_name
}
command <<<
echo 'Hello, ~{greet_name}!' > out.txt
>>>
output {
File outfile = "out.txt"
}
}{
"outputs": {
"hello_world.write_greeting.outfile":
"/Users/[email protected]/work/srna/wdl_tutorial/hello/
cromwell-executions/hello_world/1dd3abd8-be70-418b-9a31-b4ea9d5add99/
call-write_greeting/execution/out.txt"
},
"id": "1dd3abd8-be70-418b-9a31-b4ea9d5add99"
}$ cat cromwell-executions/hello_world/1dd3abd8-be70-418b-9a31-b4ea9d5add99/
call-write_greeting/execution/out.txt
Hello, Geoffrey!$ dx new project "Workflow Test"
Created new project called "Workflow Test" (project-GFbKy7Q0ff1k3fGq48ZFZ45p)
Switch to new project now? [y/N]: y$ java -jar ~/dxCompiler-2.10.2.jar compile workflow.wdl -folder /workflows \
> -project project-GFbKy7Q0ff1k3fGq48ZFZ45p
workflow-GFbP9480ff1zVQPG48zXpfzb$ dx run workflow-GFbP9480ff1zVQPG48zXpfzb
Entering interactive mode for input selection.
Input: stage-common.name (stage-common.name)
Class: string
Enter string value ('?' for more options)
stage-common.name: Ronald
Select an optional parameter to set by its # (^D or <ENTER> to finish):
[0] stage-common.overrides___ (stage-common.overrides___)
[1] stage-common.overrides______dxfiles (stage-common.overrides______dxfiles)
[2] stage-0.greet_name (stage-0.greet_name) [default={"$dnanexus_link": {"outputField": "name", "stage": "stage-common"}}]
[3] stage-0.overrides___ (stage-0.overrides___)
[4] stage-0.overrides______dxfiles (stage-0.overrides______dxfiles)
[5] stage-outputs.overrides___ (stage-outputs.overrides___)
[6] stage-outputs.overrides______dxfiles (stage-outputs.overrides______dxfiles)
Optional param #:
The following 1 stage(s) will reuse results from a previous analysis:
Stage 2: outputs (job-GFbPJx80ff1gYQy5Fg3pK3GY)
Using input JSON:
{
"stage-common.name": "Ronald"
}
Confirm running the executable with this input [Y/n]: y
Calling workflow-GFbP9480ff1zVQPG48zXpfzb with output destination
project-GFbKy7Q0ff1k3fGq48ZFZ45p:/
Analysis ID: analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ$ dx run workflow-GFbP9480ff1zVQPG48zXpfzb -j '{"stage-common.name": "Ronald"}'
-y
The following 3 stage(s) will reuse results from a previous analysis:
Stage 0: common (job-GFbPjVj0ff1ZypqJ8vQj8kjf)
Stage 1: write_greeting (job-GFbPjVj0ff1ZypqJ8vQj8kjg)
Stage 2: outputs (job-GFbPJx80ff1gYQy5Fg3pK3GY)
Using input JSON:
{
"stage-common.name": "Ronald"
}
Calling workflow-GFbP9480ff1zVQPG48zXpfzb with output destination
project-GFbKy7Q0ff1k3fGq48ZFZ45p:/
Analysis ID: analysis-GFbPkFj0ff1k3fGq48ZFZ5Jy$ cat app_inputs.json
{"stage-common.name": "Ronald"}$ dx run -f app_inputs.json workflow-GFbP9480ff1zVQPG48zXpfzb$ dx run workflow-GFbP9480ff1zVQPG48zXpfzb -h
usage: dx run workflow-GFbP9480ff1zVQPG48zXpfzb [-iINPUT_NAME=VALUE ...]
Workflow: hello_world
Inputs:
stage-common
stage-common.name: -istage-common.name=(string)
stage-common: Reserved for dxCompiler
stage-common.overrides___: [-istage-common.overrides___=(hash)]
stage-common.overrides______dxfiles: [-istage-common.overrides______dxfiles=(>
stage-0
stage-0.greet_name: [-istage-0.greet_name=(string, default={"$dnanexus_link":>
stage-0: Reserved for dxCompiler
stage-0.overrides___: [-istage-0.overrides___=(hash)]
stage-0.overrides______dxfiles: [-istage-0.overrides______dxfiles=(file) [-is>
stage-outputs: Reserved for dxCompiler
stage-outputs.overrides___: [-istage-outputs.overrides___=(hash)]
stage-outputs.overrides______dxfiles: [-istage-outputs.overrides______dxfiles>
Outputs:
stage-common.name: stage-common.name (string)
stage-0.outfile: stage-0.outfile (file)$ dx run workflow-GFbP9480ff1zVQPG48zXpfzb -istage-common.name=KeithResult 1:
ID analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ
Class analysis
Job name hello_world
Executable name hello_world
Project context project-GFbKy7Q0ff1k3fGq48ZFZ45p
Billed to org-sos
Workspace container-GFbPjVj0ff1ZypqJ8vQj8kjb
Workflow workflow-GFbP9480ff1zVQPG48zXpfzb
Priority normal
State done
Root execution analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ
Parent job -
Stage 0 common (stage-common)
Executable applet-GFbP93j0ff1py9y87vzB2QQJ
Execution job-GFbPjVj0ff1ZypqJ8vQj8kjf (done)
Stage 1 write_greeting (stage-0)
Executable applet-GFbP9380ff1XzVKkG9kyVg64
Execution job-GFbPjVj0ff1ZypqJ8vQj8kjg (done)
Stage 2 outputs (stage-outputs)
Executable applet-GFbP9400ff1pK6v113KJQF9g
Execution [job-GFbPJx80ff1gYQy5Fg3pK3GY] (done)
Cached from analysis-GFbPJx80ff1gYQy5Fg3pK3GP
Input stage-common.name = "Ronald"
[stage-0.greet_name = {"$dnanexus_link": {"analysis":
"analysis-GFbPjVj0ff1ZypqJ8vQj8kjZ", "stage":
"stage-common", "field": "name", "wasInternal": true}}]
Output stage-common.name = "Ronald"
stage-0.outfile = file-GFbPkBj0XFYgB7Vj4pF8XXBQ
Output folder /
Launched by kyclark
Created Wed Aug 3 15:52:55 2022
Finished Wed Aug 3 15:54:51 2022 (Wall-clock time: 0:01:55)
Last modified Wed Aug 3 15:54:54 2022
Depends on -
Tags -
Properties -
Total Price $0.00
detachedFrom null
rank 0
priceComputedAt 1659567291327
currency {"dxCode": 0, "code": "USD", "symbol": "$",
"symbolPosition": "left",
"decimalSymbol": ".",
"groupingSymbol": ","}
totalEgress {"regionLocalEgress": 0, "internetEgress": 0,
"interRegionEgress": 0}
egressComputedAt 1659567291327
costLimit null$ dx cat file-GFbPkBj0XFYgB7Vj4pF8XXBQ
Hello, Ronald!$ dx download file-GFbPkBj0XFYgB7Vj4pF8XXBQ
[===========================================================>] Completed 15
of 15 bytes (100%) /Users/[email protected]/work/srna/wdl_tutorial/stdout$ cat stdout
Hello, Ronald!WORKFLOW = workflow.wdl
PROJECT_ID = project-GFPQvY007GyyXgXGP7x9zbGb
DXCOMPILER = java -jar ~/dxCompiler-2.10.2.jar
CROMWELL = java -jar ~/cromwell-82.jar
check:
miniwdl check $(WORKFLOW)
local:
$(CROMWELL) run --inputs inputs.json $(WORKFLOW)
local2:
$(CROMWELL) run workflow2.wdl
app:
$(DXCOMPILER) compile $(WORKFLOW) \
-archive \
-folder /workflows \
-project $(PROJECT_ID)
clean:
rm -rf cromwell-workflow-logs cromwell-executionssamplesheet: [-isamplesheet=(file)]
(Nextflow pipeline required)// Use newest nextflow dsl - not required to add this line - only dsl2 is supported on DNAnexus
nextflow.enable.dsl = 2
log.info """\
===================================
F A S T Q C - E X A M P L E
===================================
samplesheet : ${params.samplesheet}
reads_dir : ${params.reads_dir}
outdir : ${params.outdir}
"""
.stripIndent()
process FASTQC {
tag "FastQC - ${sample_id}"
container 'quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0'
cpus 2
memory { 4.GB * task.attempt }
publishDir "${params.outdir}", pattern: "*", mode:'copy'
input:
tuple val(sample_id), path(reads)
output:
path "*"
script:
"""
fastqc --threads ${task.cpus} $reads
"""
}
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MAIN WORKFLOW
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
workflow {
if (params.samplesheet != null && params.reads_dir == null) {
reads_ch = Channel
.fromPath(params.samplesheet)
.splitCsv()
.map { row -> tuple(row[0], row[1], row[2]) }
reads_ch.view()
FASTQC(reads_ch)
} else if (params.samplesheet == null && params.reads_dir != null) {
reads_ch = Channel.fromFilePairs(params.reads_dir)
reads_ch.view()
FASTQC(reads_ch)
} else {
error "Either samplesheet or reads_dir should be provided, not both"
}
}
workflow.onComplete {
log.info ( workflow.success ? "\nworkflow is done!\n" : "Oops .. something went wrong" )
}process foo {
publishDir "${params.outdir}/fastqc/html", pattern "*.html", mode:'copy'
publishDir "${params.outdir}/fastqc/zip", pattern "*.zip"
..
}// Default parameters
docker {
enabled = true
}
params {
samplesheet = null
reads_dir = null
outdir = "./results"
}
// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']docker {
enabled = true
}params {
samplesheet = null
reads_dir = null
outdir = "./results"
}// Processes should always fail if any pipe element has a non-zero exit code.
process.shell = ['/bin/bash', '-euo', 'pipefail']{
"$schema": "http://json-schema.org/draft-07/schema",
"$id": "https://raw.githubusercontent.com/YOUR_PIPELINE/master/nextflow_schema.json",
"title": "Nextflow pipeline parameters",
"description": "This pipeline uses Nextflow and processes some kind of data. The JSON Schema was built using the nf-core pipeline schema builder.",
"type": "object",
"definitions": {
"inputs": {
"title": "Inputs",
"type": "object",
"description": "",
"default": "",
"properties": {
"samplesheet": {
"type": "string",
"description": "Input samplesheet in CSV format",
"format": "file-path"
},
"reads_dir": {
"type": "string",
"description": "Reads directory for file pairs with wildcard",
"format": "directory-path"
},
"outdir": {
"type": "string",
"format": "directory-path",
"description": "Local path to output directory",
"default": "./results"
}
}
}
},
"allOf": [
{
"$ref": "#/definitions/inputs"
}
]
}#select project
dx select project-IDmain.nf
nextflow.config
nextflow_schema.jsondx build --nextflow fastqc-nfdx build -a --nextflow fastqc-nf --destination project-XXXXX:/TEST/fastqc-nfdx build -a --nextflow fastqc-nf --destination project-XXXXX:/test-fastqc-nf{"id": "applet-ID"}dx run fastqc-nf -husage: dx run fastqc-nf [-iINPUT_NAME=VALUE ...]
Applet: fastqc-nf
fastqc-nf
Inputs:
outdir: [-ioutdir=(string)]
(Nextflow pipeline required) Default value:./results
reads_dir: [-ireads_dir=(string)]
(Nextflow pipeline required)
samplesheet: [-isamplesheet=(file)]
(Nextflow pipeline required)
....outdir: [-ioutdir=(string)]
(Nextflow pipeline required) Default value:./resultssample_name,fastq_1,fastq_2
sampleA,dx://project-xxx:/path/to/sampleA_r1.fastq.gz,dx://project-xxx:/path/to/sampleA_r2.fastq.gzdx run fastqc-nf \
-ireads_dir="dx://project-ID:/FASTQ/*_{1,2}.fastq.gz" \
-ioutdir="./fastqc-out-rd" \
-ipreserve_cache=true \
-inextflow_run_opts='-queue-size 10' \
--destination "project-ID:/USERS/FOLDERNAME" \
--name fastqc-nf-with-reads-dir \
-ydx run fastqc-nf -isamplesheet="project-ID:/samplesheet-example.csv" \
-ioutdir="./fastqc-out-sh" \
--destination "project-ID:/USERS/FILENAME" \
--name fastqc-nf-with-samplesheet \
-yreads_dir: [-ireads_dir=(string)]
(Nextflow pipeline required)Users of the platform like to interact with it in a variety of ways (shown below), but this section is dedicated to those that want to learn how to interact with it using the command line, or CLI.
The CLI interacts with the platform in the following way:
The CLI (command line interface) is run locally on your own machine.
On your local machine, you will download the SDK (software development kit), which we also call dx-toolkit. Information on downloading it and other requirements is found in the Getting Started Guide. Once set up, this allows you to log into the platform and explore your data/ projects, create apps and workflows, and launch analyses.
API (application programming interface) Servers are used for us to interact with the Platform using HTTP requests. The arguments for this request are fields in a JSON file. If you want more details on this structure, you can go to .
Please ensure that you are running Python 3 before starting this install.
To install:
To upgrade dxpy
Further details can be found in our if you need it.
The dx command will be your most used utility for interacting with the DNAnexus platform. You can run the command with no arguments or with the -h or --help flags to see the usage:
Sometime the usage make occupy your entire terminal, in which case you may see (END) to show that you are at the end of the documentation. Press q to quit the usage, or use the universal Ctrl-C to send an interrupt signal to the process to kill it.
Run dx help to read about the categories of commands you can run:
Let's start by using dx login to gain access to the DNAnexus platform from the command line. All dx commands will respond to -h|--help, so run the command with one of these flags to read the usage:
The help documentation is often called the usage because that is often the first word of the output. In the previous output, notice that the all the arguments are enclosed in square brackets, e.g., [--token TOKEN]. This is a common convention in Unix documentation to indicate that the argument is optional. The lack of such square brackets means the argument is required.
Some of the arguments require a value to follow. For example, --token TOKEN means the argument --token must be followed by the string value for the token. Arguments like --save are known as flags. They are either present or not and often represent a Boolean value, usually "True" when present and "False" when absent.
The most basic usage for login is to enter your username and password when prompted:
TODO: Reasons for using tokens, security, dangers. You may also generate a token in the web UI for use on the command line:
Information on setting up tokens can be found in the section of our Documentation.
Use dx logout to log out of the platform. This invalidates a token.
If you are ever in doubt of your username, use dx whoami to see your identity.
When you ssh into a cloud workstation, you will be your normal DNAnexus user.
When running the ttyd app to access a cloud workstation through the UI, you will be the privileged Unix user root.
When you ssh into a running job, you will be the user dnanexus.
A project is the smallest unit of sharing in DNAnexus, and you must always work in the context of a project. Upon login, you will be prompted to select a project. To change projects, use dx select. Use -h|--help to view the usage:
When run with no options, you will be presented a list of your projects and privilege:
Press Enter to choose the first project, or select a number 0-9 to choose a project or m for "more" options. You can also provide a project name or ID as the first argument:
Use the --level option to specify only projects where you have a particular permission. For instance, dx select --level ADMINISTER will show only projects where you are an administrator.
Normally, projects are private to your organization, but the --public option will display the public projects that DNAnexus uses to share common resources like sequence files or indexes for reference genomes:
Press Ctrl-C to exit the program without making a selection.
If you are ever in doubt as to your current project, run dx pwd (print working directory):
Alternatively, you can run dx env to see your current environment:
If I wanted to share some data with a collaborator, I would use dx new project to create a new project to hold select data and apps. Following is the usage:
I will use this command to create a new project in the AWS US-East-1 region. See the documentation for a list of . The command displays the new project ID and prompts to switch into the new project:
Next, I would use dx invite <user-id> to invite users to the project. Start with the usage to see how to call the command:
The usage to see that this command includes three positional arguments, the first of which (invitee) is required and the other two (project, permissions) are optional. Your currently selected project is the default project, and "VIEW" is the default permission. If you wish to indicate some permission other than "VIEW," you must specify the project first.
Use dx uninvite <user-id> to revoke a user's access to a project:
Earlier, I introduced dx pwd to print working directory to find my currently selected project.
Notice that the output shows the project name and the directory /, which is the root directory of the project:
The command dx ls will list the contents of a directory. Notice in the usage that the directory name is optional, in which case it will use the current working directory:
There is nothing to list because I just created this project, so I'll add some data next.
I will use the command dx cp to copy a small file from one of the public projects into my project. I'll start with the usage:
The usage shows source [source …], which is another Unix convention to indicate that the argument may be repeated. This means you can indicate several source files or directories to be copied to the final destination.
I'll copy the file hs38DH.dict from the project "Reference Genome Files: AWS US (East)" into the root directory of my new project. The command will only produce output on error:
I must specify the source file using the project and file ID. When you refer to files inside your current project, it's only necessary to use the file ID.
Now I can list the one file:
Often you'll want to use the file ID, which you can view using the -l|--long flag to see the long listing that includes more metadata:
I've decided I want to create a data directory to hold files such as this, so I will use dx mkdir data. The command will produce no output on success. A new listing shows data/ where the trailing slash indicates this is a directory:
To move the hs38DH.dict into the data directory, I can either use the file name or ID:
A new listing shows that the file is no longer in the root directory:
I can specify the data directory to view the contents:
Alternatively, I can use dx cd data to change directories. The command dx pwd will verify that I'm in the new folder:
If I execute dx ls now, I'll see the contents of the data directory:
Return to the root directory of the project by runing dx cd or dx cd /.
Another way to inspect the structure of a project is using dx tree:
With no options, you will see a tree structure of the project:
This command will also show the long listing with -l|--long:
I want to create a local file on my computer and add it to the project. I'll use the echo command to redirect some text into a file:
I'll use the dx upload command. The usage shows that filename is required and may be repeated.
There are many options to the command, and here are a few to highlight:
--brief: Display a brief version of the return value; for most commands, prints a DNAnexus ID per line
-r, --recursive: Upload directories recursively
--path [PATH], --destination [PATH]: DNAnexus path to upload file(s) to (default uses current project and folder if not provided)
Run dx upload hello.txt and see that the new file exists in the root directory of your current project:
You can also upload data using the UI. Under the "Add" menu, you will find the following:
Upload Data: Use your browser to add files to the project. This is the same as using dx upload.
Copy Data From Project: Add data from existing projects on the platform. This is the same as dx cp.
Add Data From Server: Add data from any publicly accessible URL such as an HTTP or FTP site. This is the same as running the app.
In addition, we offer an app.
I would like to check the new file on the platform. The dx cat command will, like the Unix cat concatenate command, print the entire contents of a file to the console:
I can use this to verify that the file was correctly uploaded:
You might expect the following command to upload hello.txt into the data directory:
Unfortunately, this will create a file called data alongside a directory called data:
I can verify that the data file contains "hello":
Note this important part of upload's usage:
This brings up an interesting point that file names are not unique on the DNAnexus platform. The only unique identifier is the file ID, and so this is always the best way to refer to a file. To rectify the duplication, I will get the file ID:
I can remove the file using dx rm file-GXZB2180fF65j2G1197pP7By.
If I dx upload hello.txt file again, I will not overwrite the existing file. Rather, another copy of the file will be created with a new file ID:
The concept of immutability was covered in "Course 101 Overview of the DNA nexus Platfrom USer Interface": Remember the crucially important fact that data objects on the DNAnexus platform are immutable. They can only be created (e.g., by uploading them) or removed, but they can never be overwritten. A given object ID always points to the same collection of bits, which leads to downstream benefits like reusing the outputs of jobs that share the same executable and input IDs ().
I cannot remove the file by filename as it's not unique, so I'm prompted to select which file I want:
I used dx cat hello.txt to read the contents of the entire file because I knew the file had only one line. It's far safer to use dx head to look at just the first few lines (the default is 10):
For instance, I can peek at the data/hs38DH.dict file:
Another option to check the file is to download it:
Every data object on the platform has a unique identifier prefixed with the type of object such as "file-," "record-," or "applet-." Earlier, I saw that hello.txt has the ID file-GXZB1v80fF6BXJ8p7PvZPy1v. I can use the dx describe command to view the metadata:
I could use the filename, if it's unique, but it's always best practice to use the file ID:
As shown in the usage, the --delim option causes the output table to use whatever delimiter you want between the columns. This could be useful if you wish to parse the output programmatically. The tab character is the default delimiter, but I can use a comma like so:
The --json flag returns the same data in JavaScript Object Notation (JSON), which we'll discuss in a later chapter:
I can use dx describe to view the metadata associated with any object identifer on the platform. For instance, I'll use head to view the first few lines of the project's metadata:
Find another entity ID, such as your billing org, to use with the command.
I can use dx mv to move a file or directory within a project:
For instance, I can rename hello.txt to goodbye.txt with the command dx mv hello.txt goodbye.txt. The file ID remains the same:
I can also move goodbye.txt to the data directory and rename it back to hello.txt. Again, the file ID remains the same because I have only changed some of the file's metadata:
As noted in the preceeding usage, I should use dx cp to copy data from one project to another. If I attempt to copy a file within a project, I will get an error:
The only way to make an actual copy of a file is to upload it again as I did earlier when I added the hello.txt file a second time.
Data objects on the platform exist as bits in AWS or Azure storage, and the associated metadata is stored in a DNAnexus database. If two projects are in the same region such as AWS US-East-1, then dx cp doesn't actually copy the bits but rather creates a new database entry pointing to the object. This means you don't pay for additional storage. Copying between regions, however, does make a physical copy of the bits and will cost money for data egress and storage. When in doubt, use dx describe <project-id> to see a project's "Region" attribute or check the "Settings" in the project view UI.
The dx find command will help you search for entities including:
apps
globalworkflows
jobs
data
I can use the dx find data command to search data objects such as files and applets. I'll display the first part of the usage as it's rather long:
Run the command in the current project to see the two files:
I can use the --name option to look for a file by name:
I can also specify a Unix file glob pattern, such as all files that begin with h:
Or all files that end with .dict. Note in this example that the asterisk is escapted with a backslash to prevent my shell from exanding it locally as I want the literal star to be given as the argument:
The --brief flag will return only the file ID:
This is useful, for instance, for downloading a file:
The --json flag will return the results in JSON format. In the JSON chapter, you will learn how to parse these results for more advanced querying and data manipulation:
The --class option accepts the following values:
applet
database
file
record
The --state options accepts the following values:
open: A file that is currently being uploaded
closing: A file that is done uploading but is still being validated
closed: A file that is uploaded and validated
There are many more options for finding data and other entities on the platform that will be covered in later chapters.
It's time to run an app, but which one? I'd like to have a FASTQ file to work with, so I'll start by using the SRA FASTQ Importer. I can never quite remember the name of the app, so I'll search for it using a wildcard:
The "x" in the first column indicates this is an app supported by DNAnexus.
I can find information about the inputs and outputs to the app using either of these commands:
dx describe sra_fastq_importer
dx run sra_fastq_importer -h
I prefer the output from the second command:
Looking at the usage for the app, I see that only the -iaccession argument is required as all the others are shown enclosed with square brackets, e.g., [-ingc_key=(file)]. I can run the app the SRA accession (C. elegans), answering "yes" to both launching and watching the app:
The equal sign in -iaccession=SRR070372 is required.
The output of watching is the same as you would see from the UI if you click the "MONITOR" tab in the project view and then "View Log" while the app is running. The end of the watch shows the app ran successfully and that a new file was created in my project:
I can find the size of the file with dx ls:
Now I'd like to run this into FastQC. I'll search for the app by name just to be sure, and, yes, it's called "fastqc":
Again, I use either dx describe or dx run to see that the app requires
I will use the new file's ID as the input to FastQC, and I'll run it using the additional flags -y to confirm launching and --watch to immediately start watching the job:
Notice that the confirmation shows "Using input JSON". If you like, you can save that to a file called, for example, input.json:
I can then launch the job using the -f|--input-json-file argument along with the --brief flag to show only the resulting job ID:
Since the output will be the same, I can kill the job using dx terminate job-GXf930j071xJfYqfJ2kkvk8v.
The end of the watch shows that the job finishes successfully:
I would like to get a feel for the output, so I'll use dx head on the stats_txt output file ID:
You are now able to:
List the advantages to interacting with platform via command line interface
List the functions of the SDK and the API
Describe the purpose of the dx-toolkit
Apply frequently used dx-toolkit commands to execute common use cases, applicable to a broad audience of users
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.
Import From AWS S3: Add data from an S3 bucket. This is the same as running the AWS S3 Importer app.
orgs
org members
org projects
org apps
workflow
any: any of the above

pip3 install dxpypip3 install –upgrade dxpyusage: dx [-h] [--version] command ...
DNAnexus Command-Line Client, API v1.0.0, client v0.346.0
dx is a command-line client for interacting with the DNAnexus platform. You
can log in, navigate, upload, organize and share your data, launch analyses,
and more. For a quick tour of what the tool can do, see
https://documentation.dnanexus.com/getting-started/tutorials/cli-quickstart#q>
For a breakdown of dx commands by category, run "dx help".
dx exits with exit code 3 if invalid input is provided or an invalid operation
is requested, and exit code 1 if an internal error is encountered. The latter
usually indicate bugs in dx; please report them at
https://github.com/dnanexus/dx-toolkit/issues
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment
variables
--version show program's version number and exit$ dx help
usage: dx help [-h] [command_or_category] [subcommand]
Displays the help message for the given command (and subcommand if given), or
displays the list of all commands in the given category.
CATEGORIES
all All commands
session Manage your login session
fs Navigate and organize your projects and files
data View, download, and upload data
metadata View and modify metadata for projects, data, and executions
workflow View and modify workflows
exec Manage and run apps, applets, and workflows
org Administer and operate on orgs
other Miscellaneous advanced utilities$ dx login -h
usage: dx login [-h] [--env-help] [--token TOKEN] [--noprojects] [--save]
[--timeout TIMEOUT]
Log in interactively and acquire credentials. Use "--token" to log in with an
existing API token.
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment variables
--token TOKEN Authentication token to use
--noprojects Do not print available projects
--save Save token and other environment variables for future
sessions
--timeout TIMEOUT Timeout for this login token (in seconds, or use suffix
s, m, h, d, w, M, y)$ dx login
Acquiring credentials from https://auth.dnanexus.com
Username: XXXXXXXX
Password: XXXXXXXX$ dx login --token xxxxxxxxxxx$ dx select -h
usage: dx select [-h] [--env-help] [--name NAME]
[--level {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}] [--public]
[project]
Interactively list and select a project to switch to. By default, only lists
projects for which you have at least CONTRIBUTE permissions. Use --public to
see the list of public projects.
positional arguments:
project Name or ID of a project to switch to; if not provided
a list will be provided for you
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment
variables
--name NAME Name of the project (wildcard patterns supported)
--level {VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}
Minimum level of permissions expected
--public Include ONLY public projects (will automatically set
--level to VIEW)$ dx select
Note: Use dx select --level VIEW or dx select --public to
select from projects for which you only have VIEW permissions.
Available projects (CONTRIBUTE or higher):
0) App Dev (ADMINISTER)
1) Methylation (ADMINISTER)
2) Genomes (ADMINISTER)
3) WTS (ADMINISTER)
4) WGS (ADMINISTER)
5) Exome (ADMINISTER)
6) QC (ADMINISTER)
7) Collaborators (ADMINISTER)
8) Pipeline Dev (ADMINISTER)
9) WDL Test (ADMINISTER)
m) More options not shown...
Pick a numbered choice or "m" for more options [0]:$ dx select project-XXXXXXXXXXXXXXXXXXXXXXXX
$ dx select "Pipeline Dev"$ dx select --public
Available public projects:
0) Reference Genome Files: Azure US (West) (VIEW)
1) App_Assets_Europe(London)_Internal (VIEW)
2) Reference Genome Files: Azure Amsterdam (VIEW)
3) Reference Genome Files: AWS Germany (VIEW)
4) Reference Genome Files: AWS US (East) (VIEW)
5) Reference Genome Files: AWS Europe (London) (VIEW)
6) App and Applet Assets Azure (VIEW)
7) dxCompiler_Europe_London (VIEW)
8) dxCompiler_Sydney (VIEW)
9) dxCompiler_Berlin (VIEW)
m) More options not shown...
Pick a numbered choice or "m" for more options:$ dx pwd
Pipeline Dev:/$ dx env
Auth token used XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
API server protocol https
API server host api.dnanexus.com
API server port 443
Current workspace project-XXXXXXXXXXXXXXXXXXXXXXXX
Current workspace name "Pipeline Dev"
Current folder /
Current user test_user$ dx new project -h
usage: dx new project [-h] [--brief | --verbose] [--env-help]
[--region REGION] [-s] [--bill-to BILL_TO] [--phi]
[--database-ui-view-only]
[name]
Create a new project
positional arguments:
name Name of the new project
options:
-h, --help show this help message and exit
--brief Display a brief version of the return value; for most
commands, prints a DNAnexus ID per line
--verbose If available, displays extra verbose output
--env-help Display help message for overriding environment
variables
--region REGION Region affinity of the new project
-s, --select Select the new project as current after creating
--bill-to BILL_TO ID of the user or org to which the project will be
billed. The default value is the billTo of the
requesting user.
--phi Add PHI protection to project
--database-ui-view-only
Viewers on the project cannot access database data
directly$ dx new project --region aws:us-east-1 demo_project
Created new project called "demo_project" (project-GXZ90x00fF6F4fy1K20x4gv9)
Switch to new project now? [y/N]: y$ dx invite -h
usage: dx invite [-h] [--env-help] [--no-email]
invitee [project] [{VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}]
Invite a DNAnexus entity to a project. If the invitee is not recognized as a
DNAnexus ID, it will be treated as a username, i.e. "dx invite alice : VIEW"
is equivalent to inviting the user with user ID "user-alice" to view your
current default project.
positional arguments:
invitee Entity to invite
project Project to invite the invitee to
{VIEW,UPLOAD,CONTRIBUTE,ADMINISTER}
Permissions level the new member should have
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment
variables
--no-email Disable email notifications to invitee$ dx uninvite -h
usage: dx uninvite [-h] [--env-help] entity [project]
Revoke others' permissions on a project you administer. If the entity is not
recognized as a DNAnexus ID, it will be treated as a username, i.e. "dx
uninvite alice :" is equivalent to revoking the permissions of the user with
user ID "user-alice" to your current default project.
positional arguments:
entity Entity to uninvite
project Project to revoke permissions from
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment variables$ dx pwd -h
usage: dx pwd [-h] [--env-help]
Print current working directory
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment variables$ dx pwd
demo_project:/$ dx ls -h
usage: dx ls [-h] [--color {off,on,auto}] [--delimiter [DELIMITER]]
[--env-help] [--brief | --verbose] [-a] [-l] [--obj] [--folders]
[--full]
[path]
List folders and/or objects in a folder
positional arguments:
path Folder (possibly in another project) to list the
contents of, default is the current directory in the
current project. Syntax: projectID:/folder/pathusage: dx cp [-h] [--env-help] [-a] source [source ...] destination
Copy objects and/or folders between different projects. Folders will
automatically be copied recursively. To specify which project to use as a
source or destination, prepend the path or ID of the object/folder with the
project ID or name and a colon.
EXAMPLES
The first example copies a file in a project called "FirstProj" to the
current directory of the current project. The second example copies the
object named "reads.fq.gz" in the current directory to the folder
/folder/path in the project with ID "project-B0VK6F6gpqG6z7JGkbqQ000Q",
and finally renaming it to "newname.fq.gz".
$ dx cp FirstProj:file-B0XBQFygpqGK8ZPjbk0Q000q .
$ dx cp reads.fq.gz project-B0VK6F6gpqG6z7JGkbqQ000Q:/folder/path/newname.fq.>
positional arguments:
source Objects and/or folder names to copy
destination Folder into which to copy the sources or new pathname (if only
one source is provided). Must be in a different
project/container than all source paths.
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment
variables
-a, --all Apply to all results with the same name without
prompting$ dx cp project-BQpp3Y804Y0xbyG4GJPQ01xv:file-GFz5xf00Bqx2j79G4q4F5jXV /$ dx ls
hs38DH.dict$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /
State Last modified Size Name (ID)
closed 2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)$ dx ls
data/
hs38DH.dictdx mv file-GFz5xf00Bqx2j79G4q4F5jXV data
dx mv hs38DH.dict data$ dx ls
data/$ dx ls data
hs38DH.dict$ dx pwd
demo_project:/data$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /data
State Last modified Size Name (ID)
closed 2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)$ dx tree -h
usage: dx tree [-h] [--color {off,on,auto}] [--env-help] [-a] [-l] [path]
List folders and objects in a tree
positional arguments:
path Folder (possibly in another project) to list the
contents of, default is the current directory in the
current project. Syntax: projectID:/folder/path
options:
-h, --help show this help message and exit
--color {off,on,auto}
Set when color is used (color=auto is used when stdout
is a TTY)
--env-help Display help message for overriding environment
variables
-a, --all show hidden files
-l, --long use a long listing format$ dx tree
.
└─ data
└─ hs38DH.dict$ dx tree -l
.
└─ data
└─ closed 2023-07-07 16:11:56 334.68 KB hs38DH.dict
(file-GFz5xf00Bqx2j79G4q4F5jXV)$ echo hello > hello.txt$ dx upload -h
usage: dx upload [-h] [--visibility {hidden,visible}] [--property KEY=VALUE]
[--type TYPE] [--tag TAG] [--details DETAILS] [-p]
[--brief | --verbose] [--env-help] [--path [PATH]] [-r]
[--wait] [--no-progress] [--buffer-size WRITE_BUFFER_SIZE]
[--singlethread]
filename [filename ...]
Upload local file(s) or directory. If "-" is provided, stdin will be used
instead. By default, the filename will be used as its new name. If
--path/--destination is provided with a path ending in a slash, the filename
will be used, and the folder path will be used as a destination. If it does not
end in a slash, then it will be used as the final name.
positional arguments:
filename Local file or directory to upload ("-" indicates stdin
input); provide multiple times to upload multiple files
or directories$ dx ls
data/
hello.txt$ dx cat -h
usage: dx cat [-h] [--env-help] [--unicode] path [path ...]
positional arguments:
path File ID or name(s) to print to stdout
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment variables
--unicode Display the characters as text/unicode when writing to stdout$ dx cat hello.txt
hello$ dx upload hello.txt --path data$ dx ls
data/
data
hello.txt$ dx cat data
helloIf --path/--destination is provided with a path ending in a slash, the
filename will be used, and the folder path will be used as a destination.
If it does not end in a slash, then it will be used as the final name.$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /
data/
State Last modified Size Name (ID)
closed 2023-07-07 16:34:31 6 bytes data (file-GXZB2180fF65j2G1197pP7By)
closed 2023-07-07 16:34:10 6 bytes hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /
data/
State Last modified Size Name (ID)
closed 2023-07-07 17:01:20 6 bytes hello.txt (file-GXZBKYQ0fF6Pf2ZKPBF7G7j9)
closed 2023-07-07 16:34:10 6 bytes hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)$ dx rm hello.txt
The given path "hello.txt" resolves to the following data objects:
0) closed 2023-07-07 17:01:20 6 bytes hello.txt (file-GXZBKYQ0fF6Pf2ZKPBF7G7j9)
1) closed 2023-07-07 16:34:10 6 bytes hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
Pick a numbered choice or "*" for all: 0$ dx head -h
usage: dx head [-h] [--color {off,on,auto}] [--env-help] [-n N] path
Print the first part of a file. By default, prints the first 10 lines.
positional arguments:
path File ID or name to access
options:
-h, --help show this help message and exit
--color {off,on,auto}
Set when color is used (color=auto is used when stdout
is a TTY)
--env-help Display help message for overriding environment
variables
-n N, --lines N Print the first N lines (default 10)$ dx head data/hs38DH.dict
@HD VN:1.6
@SQ SN:chr1 LN:248956422 M5:6aef897c3d6ff0c78aff06ac189178dd UR:file:/home/hs38DH.fa.gz
@SQ SN:chr2 LN:242193529 M5:f98db672eb0993dcfdabafe2a882905c UR:file:/home/hs38DH.fa.gz
@SQ SN:chr3 LN:198295559 M5:76635a41ea913a405ded820447d067b0 UR:file:/home/hs38DH.fa.gz
@SQ SN:chr4 LN:190214555 M5:3210fecf1eb92d5489da4346b3fddc6e UR:file:/home/hs38DH.fa.gz
@SQ SN:chr5 LN:181538259 M5:a811b3dc9fe66af729dc0dddf7fa4f13 UR:file:/home/hs38DH.fa.gz
@SQ SN:chr6 LN:170805979 M5:5691468a67c7e7a7b5f2a3a683792c29 UR:file:/home/hs38DH.fa.gz
@SQ SN:chr7 LN:159345973 M5:cc044cc2256a1141212660fb07b6171e UR:file:/home/hs38DH.fa.gz
@SQ SN:chr8 LN:145138636 M5:c67955b5f7815a9a1edfaa15893d3616 UR:file:/home/hs38DH.fa.gz
@SQ SN:chr9 LN:138394717 M5:6c198acf68b5af7b9d676dfdd531b5de UR:file:/home/hs38DH.fa.gz$ dx download file-GFz5xf00Bqx2j79G4q4F5jXV
[===========================================================>]
Downloaded 342,714
[===========================================================>]
Completed 342,714 of 342,714 bytes (100%) /Users/[email protected]/work/academy/hs38DH.dict$ dx describe -h
usage: dx describe [-h] [--json] [--color {off,on,auto}]
[--delimiter [DELIMITER]] [--env-help] [--details]
[--verbose] [--name] [--multi]
path
Describe a DNAnexus entity. Use this command to describe data objects by name
or ID, jobs, apps, users, organizations, etc. If using the "--json" flag, it
will thrown an error if more than one match is found (but if you would like a
JSON array of the describe hashes of all matches, then provide the "--multi"
flag). Otherwise, it will always display all results it finds.
NOTES:
- The project found in the path is used as a HINT when you are using an object ID;
you may still get a result if you have access to a copy of the object in some
other project, but if it exists in the specified project, its description will
be returned.
- When describing apps or applets, options marked as advanced inputs will be
hidden unless --verbose is provided
positional arguments:
path Object ID or path to an object (possibly in another
project) to describe.
options:
-h, --help show this help message and exit
--json Display return value in JSON
--color {off,on,auto}
Set when color is used (color=auto is used when stdout
is a TTY)
--delimiter [DELIMITER], --delim [DELIMITER]
Always use exactly one of DELIMITER to separate fields
to be printed; if no delimiter is provided with this
flag, TAB will be used
--env-help Display help message for overriding environment
variables
--details Include details of data objects
--verbose Include additional metadata
--name Only print the matching names, one per line
--multi If the flag --json is also provided, then returns a JSON
array of describe hashes of all matching results$ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v
Result 1:
ID file-GXZB1v80fF6BXJ8p7PvZPy1v
Class file
Project project-GXZ90x00fF6F4fy1K20x4gv9
Folder /
Name hello.txt
State closed
Visibility visible
Types -
Properties -
Tags -
Outgoing links -
Created Fri Jul 7 16:34:09 2023
Created by kyclark
Last modified Fri Jul 7 16:34:10 2023
Media type text/plain
archivalState "live"
Size 6 bytes
cloudAccount "cloudaccount-dnanexus"$ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v --delim ,
Result 1:
ID,file-GXZB1v80fF6BXJ8p7PvZPy1v
Class,file
Project,project-GXZ90x00fF6F4fy1K20x4gv9
Folder,/
Name,hello.txt
State,closed
Visibility,visible
Types,-
Properties,-
Tags,-
Outgoing links,-
Created,Fri Jul 7 16:34:09 2023
Created by,kyclark
Last modified,Fri Jul 7 16:34:10 2023
Media type,text/plain
archivalState,"live"
Size,6 bytes
cloudAccount,"cloudaccount-dnanexus"$ dx describe file-GXZB1v80fF6BXJ8p7PvZPy1v --json
{
"id": "file-GXZB1v80fF6BXJ8p7PvZPy1v",
"project": "project-GXZ90x00fF6F4fy1K20x4gv9",
"class": "file",
"sponsored": false,
"name": "hello.txt",
"types": [],
"state": "closed",
"hidden": false,
"links": [],
"folder": "/",
"tags": [],
"created": 1688772849000,
"modified": 1688772850572,
"createdBy": {
"user": "user-kyclark"
},
"properties": {},
"details": {},
"media": "text/plain",
"archivalState": "live",
"size": 6,
"cloudAccount": "cloudaccount-dnanexus"
}$ dx describe project-GXZ90x00fF6F4fy1K20x4gv9 | head
Result 1:
ID project-GXZ90x00fF6F4fy1K20x4gv9
Class project
Name demo_project
Summary
Billed to org-sos
Access level ADMINISTER
Region aws:us-east-1
Protected false
Restricted false$ dx mv -h
usage: dx mv [-h] [--env-help] [-a] source [source ...] destination
Move or rename data objects and/or folders inside a single project. To copy
data between different projects, use 'dx cp' instead.
positional arguments:
source Objects and/or folder names to move
destination Folder into which to move the sources or new pathname (if only
one source is provided). Must be in the same project/container
as all source paths.
options:
-h, --help show this help message and exit
--env-help Display help message for overriding environment
variables
-a, --all Apply to all results with the same name without
prompting$ dx ls -l
Project: demo_project (project-GXZ90x00fF6F4fy1K20x4gv9)
Folder : /
data/
State Last modified Size Name (ID)
closed 2023-07-10 10:11:31 6 bytes goodbye.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)$ dx mv file-GXZB1v80fF6BXJ8p7PvZPy1v data/hello.txt
$ dx tree -l
.
└── data
├── closed 2023-07-10 10:13:31 6 bytes hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
└── closed 2023-07-07 16:11:56 334.68 KB hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)$ dx cp hello.txt data/hello_copy.txt
dxpy.exceptions.DXCLIError: A source path and the destination path resolved
to the same project or container. Please specify different source and
destination containers, e.g.
dx cp source-project:source-id-or-path dest-project:dest-pathusage: dx find data [-h] [--brief | --verbose] [--json]
[--color {off,on,auto}] [--delimiter [DELIMITER]]
[--env-help] [--property KEY[=VALUE]] [--tag TAG]
[--class {record,file,applet,workflow,database}]
[--state {open,closing,closed,any}]
[--visibility {hidden,visible,either}] [--name NAME]
[--type TYPE] [--link LINK] [--all-projects]
[--path PROJECT:FOLDER] [--norecurse]
[--created-after CREATED_AFTER]
[--created-before CREATED_BEFORE] [--mod-after MOD_AFTER]
[--mod-before MOD_BEFORE] [--region REGION]
Finds data objects subject to the given search parameters. By default,
restricts the search to the current project if set. To search over all
projects (excluding public projects), use --all-projects (overrides --path and
--norecurse).$ dx find data
closed 2023-07-10 10:13:31 6 bytes /data/hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
closed 2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)$ dx find data --name hs38DH.dict
closed 2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)$ dx find data --name "h*"
closed 2023-07-10 10:13:31 6 bytes /data/hello.txt (file-GXZB1v80fF6BXJ8p7PvZPy1v)
closed 2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)$ dx find data --name \*.dict
closed 2023-07-07 16:11:56 334.68 KB /data/hs38DH.dict (file-GFz5xf00Bqx2j79G4q4F5jXV)$ dx find data --name \*.dict --brief
project-GXZ90x00fF6F4fy1K20x4gv9:file-GFz5xf00Bqx2j79G4q4F5jXV$ dx download $(dx find data --name \*.dict --brief)
[=======================>] Completed 342,714 of 342,714 bytes (100%)
/Users/[email protected]/work/academy/hs38DH.dict$ dx find data --name \*.dict --json
[
{
"project": "project-GXZ90x00fF6F4fy1K20x4gv9",
"id": "file-GFz5xf00Bqx2j79G4q4F5jXV",
"describe": {
"id": "file-GFz5xf00Bqx2j79G4q4F5jXV",
"project": "project-GXZ90x00fF6F4fy1K20x4gv9",
"class": "file",
"name": "hs38DH.dict",
"state": "closed",
"folder": "/data",
"modified": 1688771516882,
"size": 342714
}
}
]$ dx find apps --name "sra*"
x SRA FASTQ Importer (sra_fastq_importer), v4.0.0$ dx run sra_fastq_importer -h
usage: dx run sra_fastq_importer [-iINPUT_NAME=VALUE ...]
App: SRA FASTQ Importer
Version: 4.0.0 (published)
Download SE or PE reads in FASTQ or FASTA format from SRA using SRR accessions
See the app page for more information:
https://platform.dnanexus.com/app/sra_fastq_importer
Inputs:
dbGaP Repository key: [-ingc_key=(file)]
(Optional) Security token required for configuring NCBI SRA toolkit and decryption tools.
SRR Accession: -iaccession=(string)
Single SRR accession to fetch.$ dx run sra_fastq_importer -iaccession=SRR070372
Using input JSON:
{
"accession": "SRR070372"
}
Confirm running the executable with this input [Y/n]: y
Calling app-G49BFZ093qKvjFYgF8fyv6Z7 with output destination project-GXY0PK0071xJpG156BFyXpJF:/
Job ID: job-GXf8Qg8071xBJJg417YVYJX3
Watch launched job now? [Y/n] y* SRA FASTQ Importer (sra_fastq_importer:main) (done)
job-GXf8Qg8071xBJJg417YVYJX3
kyclark 2023-07-10 15:38:21 (runtime 0:02:36)
Output: single_reads_fastq = [ file-GXf8VgQ09bzK5q1XV5z1gx7j ]$ dx ls -l file-GXf8VgQ09bzK5q1XV5z1gx7j
closed 2023-07-10 15:41:38 206.59 MB SRR070372.fastq.gz (file-GXf8VgQ09bzK5q1XV5z1gx7j)$ dx find apps --name fastqc
x FastQC Reads Quality Control (fastqc), v3.0.3usage: dx run fastqc [-iINPUT_NAME=VALUE ...]
App: FastQC Reads Quality Control
Version: 3.0.3 (published)
Generates a QC report on reads data
See the app page for more information:
https://platform.dnanexus.com/app/fastqc
Inputs:
Reads: -ireads=(file)
A file containing the reads to be checked. Accepted formats are
gzipped-FASTQ and BAM.$ dx run fastqc -ireads=file-GXf8P880FjgZGJQqx8Bf30YK -y --watch
Using input JSON:
{
"reads": {
"$dnanexus_link": "file-GXf8P880FjgZGJQqx8Bf30YK"
}
}
Calling app-G81jg5j9jP7qxb310vg2xQkX with output destination project-GXY0PK0071xJpG156BFyXpJF:/
Job ID: job-GXf8fJQ071x00P5bQzQ62gjY$ cat input.json
{
"reads": {
"$dnanexus_link": "file-GXf8P880FjgZGJQqx8Bf30YK"
}
}$ dx run fastqc -f input.json -y --brief
job-GXf930j071xJfYqfJ2kkvk8v* FastQC Reads Quality Control (fastqc:main) (done) job-GXf8fgj071x3KV4qyyKGZQVY
kyclark 2023-07-10 15:51:11 (runtime 0:02:01)
Output: report_html = file-GXf8gbQ06GxZ38zFXB46XYYj
stats_txt = file-GXf8gbj06Gxy9F8P66pJG7J3$ dx head file-GXf8gbj06Gxy9F8P66pJG7J3
##FastQC 0.11.9
>>Basic Statistics pass
#Measure Value
Filename SRR070372.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 498843
Sequences flagged as poor quality 0
Sequence length 48-2044
%GC 39