nf-core: Proteinfold

Necessary Disclaimers and Legal

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with nf-core/proteinfold, protein structure prediction methods like Alphafold2, EMSFold, ColabFold and genetic databases and their storage in their project spaces.

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

Citation for nf-core/proteinfold

nf-core is a community-driven collection of curated bioinformatics pipelines built using Nextflow, providing standardized, scalable, and portable workflows for reproducible analysis across diverse computing environments. Following these standards, nf-core/proteinfold enables state-of-the-art protein structure prediction through optimized Nextflow execution and currently integrates leading AI methods such as AlphaFold, ESMFold and ColabFold.

If you use nf-core/proteinfold, please cite the pipeline using nf-core/proteinfold print. Additional references for integrated tools are listed in the pipeline’s CITATIONS.md

Overview of nf-core/proteinfold pipeline

The nf-core/proteinfold pipeline provides an automated workflow for protein structure prediction using AlphaFold2, ESMFold, and ColabFold. Users start by uploading a samplesheet that links sequence IDs to their FASTA files, and the pipeline then prepares input data, fetches required databases, and runs the selected folding model. Each FASTA entry is processed as an individual prediction, while multimer sequences can be combined in a single multi-FASTA file. Built with nf-core and Nextflow best practices, the pipeline is scalable, reproducible, and ready to run on cloud or HPC systems, making advanced deep-learning structure prediction easier to use and share. Detailed information about the pipeline including descriptions, parameters, and outputs, is available at official documentation and Github

On DNAnexus, we have made the nf-core/proteinfold workflow (version 1.1.1) directly available, together with a notebook for preparing inputs and both mini and full database options (up to ~3 TB). See the “Where to Access nf-core/proteinfold” section below to start accessing the dataset.

Where to Access nf-core/proteinfold

The DNAnexus platform includes:

Imported nf-core/proteinfold version 1.1.1 with applet ID: applet-J5PY67j03fbk0Xz7vqp7x9bb on the platform.
While based on nf-core/proteinfold v1.1.1, the standard release only supports CUDA 11 (Driver R470). To upgrade support to CUDA 12 (Driver R535), we built the proteinfold_modified_GPU_R535_v1.1.1 applet by utilizing the development containers for both AlphaFold2 and ColabFold2. Access this applet on the platform.
A sample sheet from Github's nf-core/proteinfold is available on the platform and sample sheets which are prepared using our notebook. Sample sheets are in .csv format. We also provided example .fasta files on the platform.
Configuration files are included for GPU, CPU run and tarball setups for the pdb_mmcif dataset with a large number of files. The location of these config files are here on the platform.
Full AlphaFold2 databases and model parameters (~3 TB), as well as a smaller “mini” version (~340 GB), are available for testing. These files are here on the platform.
Full ColabFold databases and model parameters (~ 1 TB). These files are here on the platform
Data assets for the ESMFold model (~8 GB) are also provided here on the platform.

To use these resources, simply copy the datasets and notebooks into your own DNAnexus project. Instructions are included in the “Copying Data and Notebooks into a Project’’ section.

Running nf-core/proteinfold on DNAnexus

Copying Data and Notebooks into a Project

To utilize the dataset, please copy the data from this project into your own project. Here are the steps to copy the data into a Project Space:

Create a project for your single cell analysis, billed to your own organization. Tutorials on how to set up a project can be found on this page.
Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder "ProteinFold".
Select the data folder and the notebook
Select "Copy" on the top right menu, and select the project that you created in Step 1.
Then, go to the project space you created in Step 1 to start exploring the pipeline.
To run the Nextflow on DNAnexus, please see the Running Nextflow Pipelines Documentation. To learn more about Nextflow on DNAnexus please read the Academy Documentation.

Instance Type Selection

Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability
GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.
For selecting CPU and GPU, please use the prepared configuration files as examples and nf-proteinfold documentation
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Requirements for our example nf-core/proteinfold pipeline:
- nf-core/proteinfold pipeline with mini version of databases for Alphafold2
  - GPU Instance type for Alphafold2 step: mem2_ssd1_gpu_x32
  - Config files: proteinfold_gpu_instance.config, beforescript_tarball_minidb.config
  - Applet: proteinfold_v.1.1.1
nf-core/proteinfold pipeline with full version of databases for Alphafold2
- CPU instance type for Alphafold2 step: mem3_ssd3_x24
  - Config files: beforescript_tarball_fulldb.config, run_alphafold2_highmem_cpu.config
  - Applet: proteinfold_v.1.1.1
- GPU instance type for Alphafold2 step: mem2_ssd2_gpu4_v2_x48
  - Config files: alphafold2_R535.config, beforescript_tarball_fulldb.config
  - Applet: proteinfold_modified_GPU_R535_v1.1.1
nf-core/proteinfold pipeline with full version of databases for ColaFold2
- CPU instance type for ColaFold2 step: mem3_ssd3_x24
  - Config files: full_db_cpu_colabfold.config
  - Applet: proteinfold_v.1.1.1
- GPU instance type for ColaFold2 step: mem2_ssd2_gpu1_v2_x64
  - Config files: proteinfold_gpu_instance_full_db_Colabfold.config
  - Applet: proteinfold_modified_GPU_R535_v1.1.1
nf-core/proteinfold pipeline with full version of databases for EMSFold
- CPU instance type for EMSFold step: mem2_ssd1_v2_x4
  - Applet: proteinfold_v.1.1.1
- GPU instance type for EMSFold step: mem2_ssd1_gpu_x16
  - Config files: run_esmfold_gpu.config
  - Applet: proteinfold_v.1.1.1

A note on nf-core/proteinfold pipeline

A note on output file from MultiQC step in pipeline
- We identified an off-by-one pLDDT indexing issue in the summary .tsv files, prepared a notebook to automatically correct it, and the notebook is here on the platform. The file name is fix_plddt_mqc_file.ipynb. We also reported the problem to the nf-core/proteinfold project.
You can learn more about outputs from pipeline in the nf-core/proteinfold documentation
A note on outdir paramater:
- Specify the folder name where you want to save your output. You do not need to create this folder beforehand, as the pipeline will create it automatically. You can provide the destination path (parent folder) in the destination parameter.
We prepared a markdown to explain the pipeline on DNAnexus and few considerations when you run the pipeline. The file name is nf-proteinfold.md

Video: Running Proteinfold on the DNAnexus Platform

PreviousOpen Targets

Last updated 1 month ago

Was this helpful?

hashtagNecessary Disclaimers and Legal

hashtagCitation for nf-core/proteinfold

hashtagOverview of nf-core/proteinfold pipeline

hashtagWhere to Access nf-core/proteinfold

hashtagRunning nf-core/proteinfold on DNAnexus

hashtagCopying Data and Notebooks into a Project

hashtagTo utilize the dataset, please copy the data from this projectarrow-up-right into your own project. Here are the steps to copy the data into a Project Space:

hashtagInstance Type Selection

hashtagA note on nf-core/proteinfold pipeline

hashtagVideo: Running Proteinfold on the DNAnexus Platform