Introduction to CLI

Overview of Interacting with the Platform

Users of the platform like to interact with it in a variety of ways (shown below), but this section is dedicated to those that want to learn how to interact with it using the command line, or CLI.

Terms

The CLI interacts with the platform in the following way:

  • The CLI (command line interface) is run locally on your own machine.

  • On your local machine, you will download the SDK (software development kit), which we also call dx-toolkit. Information on downloading it and other requirements is found in the Getting Started Guide. Once set up, this allows you to log into the platform and explore your data/ projects, create apps and workflows, and launch analyses.

  • API (application programming interface) Servers are used for us to interact with the Platform using HTTP requests. The arguments for this request are fields in a JSON file. If you want more details on this structure, you can go to DNAnexus API.

Installation

Please ensure that you are running Python 3 before starting this install.

To install:

To upgrade dxpy

Further details can be found in our Documentation if you need it.

Introducing dx-toolkit

The dx command will be your most used utility for interacting with the DNAnexus platform. You can run the command with no arguments or with the -h or --help flags to see the usage:

Sometime the usage make occupy your entire terminal, in which case you may see (END) to show that you are at the end of the documentation. Press q to quit the usage, or use the universal Ctrl-C to send an interrupt signal to the process to kill it.

Run dx help to read about the categories of commands you can run:

Logging Into the Platform

Let's start by using dx login to gain access to the DNAnexus platform from the command line. All dx commands will respond to -h|--help, so run the command with one of these flags to read the usage:

The help documentation is often called the usage because that is often the first word of the output. In the previous output, notice that the all the arguments are enclosed in square brackets, e.g., [--token TOKEN]. This is a common convention in Unix documentation to indicate that the argument is optional. The lack of such square brackets means the argument is required.

Some of the arguments require a value to follow. For example, --token TOKEN means the argument --token must be followed by the string value for the token. Arguments like --save are known as flags. They are either present or not and often represent a Boolean value, usually "True" when present and "False" when absent.

The most basic usage for login is to enter your username and password when prompted:

TODO: Reasons for using tokens, security, dangers. You may also generate a token in the web UI for use on the command line:

Information on setting up tokens can be found in the Using Tokens section of our Documentation.

Use dx logout to log out of the platform. This invalidates a token.

If you are ever in doubt of your username, use dx whoami to see your identity.

  • When you ssh into a cloud workstation, you will be your normal DNAnexus user.

  • When running the ttyd app to access a cloud workstation through the UI, you will be the privileged Unix user root.

  • When you ssh into a running job, you will be the user dnanexus.

Working with Projects and Users

A project is the smallest unit of sharing in DNAnexus, and you must always work in the context of a project. Upon login, you will be prompted to select a project. To change projects, use dx select. Use -h|--help to view the usage:

When run with no options, you will be presented a list of your projects and privilege:

Press Enter to choose the first project, or select a number 0-9 to choose a project or m for "more" options. You can also provide a project name or ID as the first argument:

Use the --level option to specify only projects where you have a particular permission. For instance, dx select --level ADMINISTER will show only projects where you are an administrator.

Normally, projects are private to your organization, but the --public option will display the public projects that DNAnexus uses to share common resources like sequence files or indexes for reference genomes:

Press Ctrl-C to exit the program without making a selection.

If you are ever in doubt as to your current project, run dx pwd (print working directory):

Alternatively, you can run dx env to see your current environment:

If I wanted to share some data with a collaborator, I would use dx new project to create a new project to hold select data and apps. Following is the usage:

I will use this command to create a new project in the AWS US-East-1 region. See the documentation for a list of all available regions. The command displays the new project ID and prompts to switch into the new project:

Next, I would use dx invite <user-id> to invite users to the project. Start with the usage to see how to call the command:

The usage to see that this command includes three positional arguments, the first of which (invitee) is required and the other two (project, permissions) are optional. Your currently selected project is the default project, and "VIEW" is the default permission. If you wish to indicate some permission other than "VIEW," you must specify the project first.

Use dx uninvite <user-id> to revoke a user's access to a project:

Data Exploration

Earlier, I introduced dx pwd to print working directory to find my currently selected project.

Notice that the output shows the project name and the directory /, which is the root directory of the project:

The command dx ls will list the contents of a directory. Notice in the usage that the directory name is optional, in which case it will use the current working directory:

There is nothing to list because I just created this project, so I'll add some data next.

Copying and Moving Files

I will use the command dx cp to copy a small file from one of the public projects into my project. I'll start with the usage:

The usage shows source [source …​], which is another Unix convention to indicate that the argument may be repeated. This means you can indicate several source files or directories to be copied to the final destination.

I'll copy the file hs38DH.dict from the project "Reference Genome Files: AWS US (East)" into the root directory of my new project. The command will only produce output on error:

I must specify the source file using the project and file ID. When you refer to files inside your current project, it's only necessary to use the file ID.

Now I can list the one file:

Often you'll want to use the file ID, which you can view using the -l|--long flag to see the long listing that includes more metadata:

I've decided I want to create a data directory to hold files such as this, so I will use dx mkdir data. The command will produce no output on success. A new listing shows data/ where the trailing slash indicates this is a directory:

To move the hs38DH.dict into the data directory, I can either use the file name or ID:

A new listing shows that the file is no longer in the root directory:

I can specify the data directory to view the contents:

Alternatively, I can use dx cd data to change directories. The command dx pwd will verify that I'm in the new folder:

If I execute dx ls now, I'll see the contents of the data directory:

Return to the root directory of the project by runing dx cd or dx cd /.

Another way to inspect the structure of a project is using dx tree:

With no options, you will see a tree structure of the project:

This command will also show the long listing with -l|--long:

Uploading Data

I want to create a local file on my computer and add it to the project. I'll use the echo command to redirect some text into a file:

I'll use the dx upload command. The usage shows that filename is required and may be repeated.

There are many options to the command, and here are a few to highlight:

  • --brief: Display a brief version of the return value; for most commands, prints a DNAnexus ID per line

  • -r, --recursive: Upload directories recursively

  • --path [PATH], --destination [PATH]: DNAnexus path to upload file(s) to (default uses current project and folder if not provided)

Run dx upload hello.txt and see that the new file exists in the root directory of your current project:

You can also upload data using the UI. Under the "Add" menu, you will find the following:

  • Upload Data: Use your browser to add files to the project. This is the same as using dx upload.

  • Copy Data From Project: Add data from existing projects on the platform. This is the same as dx cp.

  • Add Data From Server: Add data from any publicly accessible URL such as an HTTP or FTP site. This is the same as running the URL Fetcher app.

  • Import From AWS S3: Add data from an S3 bucket. This is the same as running the AWS S3 Importer app.

In addition, we offer an SRA FASTQ Importer app.

I would like to check the new file on the platform. The dx cat command will, like the Unix cat concatenate command, print the entire contents of a file to the console:

I can use this to verify that the file was correctly uploaded:

You might expect the following command to upload hello.txt into the data directory:

Unfortunately, this will create a file called data alongside a directory called data:

I can verify that the data file contains "hello":

Note this important part of upload's usage:

This brings up an interesting point that file names are not unique on the DNAnexus platform. The only unique identifier is the file ID, and so this is always the best way to refer to a file. To rectify the duplication, I will get the file ID:

I can remove the file using dx rm file-GXZB2180fF65j2G1197pP7By.

If I dx upload hello.txt file again, I will not overwrite the existing file. Rather, another copy of the file will be created with a new file ID:

The concept of immutability was covered in "Course 101 Overview of the DNA nexus Platfrom USer Interface": Remember the crucially important fact that data objects on the DNAnexus platform are immutable. They can only be created (e.g., by uploading them) or removed, but they can never be overwritten. A given object ID always points to the same collection of bits, which leads to downstream benefits like reusing the outputs of jobs that share the same executable and input IDs (smart reuse).

I cannot remove the file by filename as it's not unique, so I'm prompted to select which file I want:

I used dx cat hello.txt to read the contents of the entire file because I knew the file had only one line. It's far safer to use dx head to look at just the first few lines (the default is 10):

For instance, I can peek at the data/hs38DH.dict file:

Another option to check the file is to download it:

Inspecting Object Metadata

Every data object on the platform has a unique identifier prefixed with the type of object such as "file-," "record-," or "applet-." Earlier, I saw that hello.txt has the ID file-GXZB1v80fF6BXJ8p7PvZPy1v. I can use the dx describe command to view the metadata:

I could use the filename, if it's unique, but it's always best practice to use the file ID:

As shown in the usage, the --delim option causes the output table to use whatever delimiter you want between the columns. This could be useful if you wish to parse the output programmatically. The tab character is the default delimiter, but I can use a comma like so:

The --json flag returns the same data in JavaScript Object Notation (JSON), which we'll discuss in a later chapter:

I can use dx describe to view the metadata associated with any object identifer on the platform. For instance, I'll use head to view the first few lines of the project's metadata:

Find another entity ID, such as your billing org, to use with the command.

Copying and Moving Files

I can use dx mv to move a file or directory within a project:

For instance, I can rename hello.txt to goodbye.txt with the command dx mv hello.txt goodbye.txt. The file ID remains the same:

I can also move goodbye.txt to the data directory and rename it back to hello.txt. Again, the file ID remains the same because I have only changed some of the file's metadata:

As noted in the preceeding usage, I should use dx cp to copy data from one project to another. If I attempt to copy a file within a project, I will get an error:

The only way to make an actual copy of a file is to upload it again as I did earlier when I added the hello.txt file a second time.

Data objects on the platform exist as bits in AWS or Azure storage, and the associated metadata is stored in a DNAnexus database. If two projects are in the same region such as AWS US-East-1, then dx cp doesn't actually copy the bits but rather creates a new database entry pointing to the object. This means you don't pay for additional storage. Copying between regions, however, does make a physical copy of the bits and will cost money for data egress and storage. When in doubt, use dx describe <project-id> to see a project's "Region" attribute or check the "Settings" in the project view UI.

Finding Data

The dx find command will help you search for entities including:

  • apps

  • globalworkflows

  • jobs

  • data

  • projects

  • orgs

  • org members

  • org projects

  • org apps

I can use the dx find data command to search data objects such as files and applets. I'll display the first part of the usage as it's rather long:

Run the command in the current project to see the two files:

I can use the --name option to look for a file by name:

I can also specify a Unix file glob pattern, such as all files that begin with h:

Or all files that end with .dict. Note in this example that the asterisk is escapted with a backslash to prevent my shell from exanding it locally as I want the literal star to be given as the argument:

The --brief flag will return only the file ID:

This is useful, for instance, for downloading a file:

The --json flag will return the results in JSON format. In the JSON chapter, you will learn how to parse these results for more advanced querying and data manipulation:

The --class option accepts the following values:

  • applet

  • database

  • file

  • record

  • workflow

The --state options accepts the following values:

  • open: A file that is currently being uploaded

  • closing: A file that is done uploading but is still being validated

  • closed: A file that is uploaded and validated

  • any: any of the above

There are many more options for finding data and other entities on the platform that will be covered in later chapters.

Running Jobs

It's time to run an app, but which one? I'd like to have a FASTQ file to work with, so I'll start by using the SRA FASTQ Importer. I can never quite remember the name of the app, so I'll search for it using a wildcard:

The "x" in the first column indicates this is an app supported by DNAnexus.

I can find information about the inputs and outputs to the app using either of these commands:

  • dx describe sra_fastq_importer

  • dx run sra_fastq_importer -h

I prefer the output from the second command:

Looking at the usage for the app, I see that only the -iaccession argument is required as all the others are shown enclosed with square brackets, e.g., [-ingc_key=(file)]. I can run the app the SRA accession SRR070372 (C. elegans), answering "yes" to both launching and watching the app:

The equal sign in -iaccession=SRR070372 is required.

The output of watching is the same as you would see from the UI if you click the "MONITOR" tab in the project view and then "View Log" while the app is running. The end of the watch shows the app ran successfully and that a new file was created in my project:

I can find the size of the file with dx ls:

Now I'd like to run this into FastQC. I'll search for the app by name just to be sure, and, yes, it's called "fastqc":

Again, I use either dx describe or dx run to see that the app requires

I will use the new file's ID as the input to FastQC, and I'll run it using the additional flags -y to confirm launching and --watch to immediately start watching the job:

Notice that the confirmation shows "Using input JSON". If you like, you can save that to a file called, for example, input.json:

I can then launch the job using the -f|--input-json-file argument along with the --brief flag to show only the resulting job ID:

Since the output will be the same, I can kill the job using dx terminate job-GXf930j071xJfYqfJ2kkvk8v.

The end of the watch shows that the job finishes successfully:

I would like to get a feel for the output, so I'll use dx head on the stats_txt output file ID:

Review

You are now able to:

  • List the advantages to interacting with platform via command line interface

  • List the functions of the SDK and the API

  • Describe the purpose of the dx-toolkit

  • Apply frequently used dx-toolkit commands to execute common use cases, applicable to a broad audience of users

Resources

Full Documentation

To create a support ticket if there are technical issues:

  1. Go to the Help header (same section where Projects and Tools are) inside the platform

  2. Select "Contact Support"

  3. Fill in the Subject and Message to submit a support ticket.

Last updated

Was this helpful?