Cloud Computing for Scientists
Last updated
Was this helpful?
Last updated
Was this helpful?
Your Computer: When we utilize cloud resources, we as users request them from our own computer using commands from the dx toolkit.
DNAnexus platform: The platform has many working pieces, but we can treat it as one entity here. Our request gets sent to the platform, and given availability, it will grant access to a temporary DNAnexus Worker.
DNAnexus Worker: This temporary worker is the third key player and is where we do our computation on. We'll see that it starts out as a blank slate.
A project contains files and executables and logs associated with analysis securely stored on the platform
The executables on the platform are referred to as apps. Apps are executables that can be run on the DNAnexus platform. Most importantly, they need to contain a software environment to run the executable.
A software environment in general is everything needed to run software on a brand new computer. This includes the software itself that you are needing as well as any dependencies that are needed to run the software. Some examples of dependencies are languages (such as R) that are needed to execute the software.
Project storage is permanent, but the workers are temporary. This means that you have to relay information back and forth as shown in the figure below.
The key concept with cloud computing: project storage can be considered as permanent on the platform. Note that workers are temporary. Because workers are temporary, we need to transfer the files we want to process to them. When we are done, we need to transfer any output files back to the project storage. If we don't do this, the files will be lost when we lose access to the worker.
On your local computer, everything is on your machine.
This includes your data and the scripts, as well as your software environment and dependencies are also downloaded.
The results and in between steps are also generated and saved on your machine as well.
You own it and you control it.
This is great, but limited by how much storage and computational power that you have on your local machine.
This is highlighted in the figure below:
In comparison, cloud computing adds layers into analysis to increase computational power and storage.
This relationship and the layers involved are in the figure below:
Let's contrast this with the process of processing a file on the DNAnexus platform.
We'll start with our computer, the DNAnexus platform, and a file from project storage.
We first start out by using the dx run command, requesting to run an app on a file in project storage. This request is then sent to the platform, and an appropriate worker from the pool of workers is made available.
When the worker is available, we can transfer a file from the project to the worker.
The platform handles installing the app and its software environment to the worker as well.
Once our app is ready and our file is set, we can run the computation on the worker.
Any files that we generate must be transferred back into project storage.
The first difference is that we need to request a worker and we only have temporary access to it. We need to bring everything to the worker, including the software environment.
The second key difference is that we need to bring our files and scripts from project storage to the worker.
Our first barrier is requesting an appropriate worker that can do our computational job.
For example, our app may require more memory, or if it is optimized for working on multiple CPUs, more CPUs.
We need to understand how big our files are and the computing requirements of our software to do this.
Our second barrier is installing the software environment on the worker, such as R.
Because we are starting from scratch on a worker, we will need ways to reproducibly install the software environment on the worker.
We'll see that this is one of the roles of Apps. As part of their job, they will install the appropriate software environment.
There is some good news. If we are running apps, they will handle both of these barriers.
Number one, all apps have a default instance type to use. We'll see that we can tailor this.
Secondly, Apps install the required software environment on their workers.
Our third barrier is getting our files onto the worker from project storage, and then doing computations with them on the worker. The last barrier we'll talk about is getting the file outputs we've generated from the worker back into the project storage.
Cloud computing has a nestedness to it and transferring files back and forth can make learning it difficult.
Having a mental model of how cloud computing works can help us overcome these barriers.
Cloud computing is indirect, and you need to think 2 steps ahead.
Here is the visual for thinking about the steps for file management:
Apps help you address installing software on worker
Prebuilt software environment that is installed onto the temporary worker
Can build our own apps
Apps serve to (at minimum):
Request a worker (Challenge 1)
Configure the worker's environment (Challenge 2)
Establish data transfer (Challenge 3)
Running apps are covered throughout the rest of the documentation.
To create a support ticket if there are technical issues:
Go to the Help header (same section where Projects and Tools are) inside the platform
Select "Contact Support"
Fill in the Subject and Message to submit a support ticket.