nf-core: Proteinfold
Necessary Disclaimers and Legal
The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.
Users are responsible for the costs associated with nf-core/proteinfold, protein structure prediction methods like AlphaFold2, EMSFold, ColabFold and genetic databases and their storage in their project spaces.
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Citation for nf-core/proteinfold
nf-core is a community-driven collection of curated bioinformatics pipelines built using Nextflow, providing standardized, scalable, and portable workflows for reproducible analysis across diverse computing environments. Following these standards, nf-core/proteinfold enables state-of-the-art protein structure prediction through optimized Nextflow execution and currently integrates leading AI methods such as AlphaFold, ESMFold and ColabFold.
If you use nf-core/proteinfold, please cite the pipeline using nf-core/proteinfold print. Additional references for integrated tools are listed in the pipeline’s CITATIONS.md
Overview of nf-core/proteinfold pipeline
The nf-core/proteinfold pipeline provides an automated workflow for protein structure prediction using AlphaFold2, ESMFold, and ColabFold. Users start by uploading a samplesheet that links sequence IDs to their FASTA files, and the pipeline then prepares input data, fetches required databases, and runs the selected folding model. Each FASTA entry is processed as an individual prediction, while multimer sequences can be combined in a single multi-FASTA file.
Built with nf-core and Nextflow best practices, the pipeline is scalable, reproducible, and ready to run on cloud or HPC systems, making advanced deep-learning structure prediction easier to use and share. Detailed information about the pipeline including descriptions, parameters, and outputs, is available at official documentation and Github
On DNAnexus, we have made the nf-core/proteinfold workflow (version 1.1.1) directly available, together with a notebook for preparing inputs and both mini and full database options (up to ~3 TB). See the “Where to Access nf-core/proteinfold” section below to start accessing the dataset.

Figure: Scalable and reproducible protein structure prediction pipeline on DNAnexus. Input sequences are validated and processed through AlphaFold2, ColabFold, or ESMFold workflows, with standardized database preparation and parameterization, producing PDB structure outputs.
Where to Access nf-core/proteinfold
The DNAnexus Platform has projects (AWS US East, AWS Europe (Frankfurt), AWS Europe (London), Azure Amsterdam, Azure US (West)) that includes the following resources: applets, sample sheet and input files, soft configuration, and databases.
The nf-core/proteinfold project on DNAnexus is organized as follows:
Applets:
proteinfold_v.1.1.1: Imported nf-core/proteinfold version 1.1.1, available on the platform
proteinfold_modified_GPU_R535_v1.1.1: While based on nf-core/proteinfold v1.1.1, the standard release only supports CUDA 11 (Driver R470). To upgrade support to CUDA 12 (Driver R535), we built this applet using the development containers for both AlphaFold2 and ColabFold. Available on the platform.
Sample sheet & Input files:
A sample sheet from GitHub's nf-core/proteinfold and sample sheets prepared using our notebook are available on the platform. All sample sheets are in .csv format.
To simplify sample sheet creation, we prepared a Jupyter notebook that generates a valid CSV file. In our example, we used five CASP14 targets (T1024–T1028). You can refer to this notebook to prepare your own input files
Example .fasta files from CASP14 targets are also provided on the platform.
Soft configuration files: Configuration files are provided as examples for users to adapt for their own runs, available on the platform. They are organized into three groups:
GPU/CPU instance setup:
proteinfold_gpu_instance.config: GPU with Driver R470 (AlphaFold2)
alphafold2_R535.config: GPU with Driver R535 (AlphaFold2); use with proteinfold_modified_GPU_R535_v1.1.1 applet
colabfold_R535.config: GPU with Driver R535 (ColabFold); use with proteinfold_modified_GPU_R535_v1.1.1 applet
run_alphafold2_highmem_cpu.config: High-memory CPU (AlphaFold2)
run_colabfold_high_storage.config: High-storage CPU (ColabFold)
run_esmfold_gpu.config: GPU (ESMFold)
pdb_mmcif tarball setup: load the pdb_mmcif dataset from a single tarball file instead of transferring thousands of individual files
beforescript_tarball_minidb.config: for the mini database
beforescript_tarball_fulldb.config: for the full database
Parameter file:
test_profile_params.json: example parameter file
Databases: Pre-downloaded databases are available on the platform to avoid downloading them at runtime. The following databases are provided:
AlphaFold2: Full database (~3 TB) and Mini database (~340 GB)
ColabFold: Full database (~1 TB) and Mini database
ESMFold: Model assets (~8 GB)
Here is an example folder structure of AlphaFold2 - available on the Platform AWS US East, AWS Europe (Frankfurt), AWS Europe (London), Azure Amsterdam, and Azure US (West).
To use these resources, you have two options:
Copy to your own project by cloning the datasets and applets into your own DNAnexus project. Instructions are included in the "Copying Data and Applets into a Project" section.
Load directly from the public dataset by referencing the databases directly from the Public Datasets (our example is in AWS US (East), but there are multiple regions with the Public Datasets) without copying, which avoids storage costs in your own project.
Running nf-core/proteinfold on DNAnexus
Copying Data and Applets into a Project
To utilize the dataset and applets, please copy the data from the project listed above into your own project. Here are the steps to copy the data into a Project Space:
Create a project for your analysis, billed to your own organization. Tutorials on how to set up a project can be found on this page.
Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder "Proteinfold".
Select applets, samplesheet_input, config folders
Select "Copy" on the top right menu, and select the project that you created in Step 1.
Then, go to the project space you created in Step 1 to start exploring the pipeline.
To run the Nextflow on DNAnexus, please see the Running Nextflow Pipelines Documentation. To learn more about Nextflow on DNAnexus please read the Academy Documentation.
Preparing the Sample Sheet Input
A sample sheet (.csv) is required to describe the sequences to be analyzed. Two columns are required:
sequence: sequence identifier (e.g., T1024, T1025, …)
fasta: full path to the FASTA file. Extension must be .fasta or .fa
Here is our example:
We provide the following sample sheets and notebooks on the Platform for AWS US East, AWS Europe (Frankfurt), AWS Europe (London), Azure Amsterdam, Azure US (West):
samplesheet_proteinfold.csv: example sample sheet from nf-core/proteinfold with two samples (T1024 and T1026)
prepare_samplesheet.ipynb: Jupyter notebook to generate a valid sample sheet. In our example, we used five CASP14 targets (T1024–T1028)
samplesheet.csv: generated sample sheet from the notebook above
Please refer to the nf-core/proteinfold website for more information about the input file.
Minimal test
Use the test configuration (test.config) provided by nf-core/proteinfold to quickly validate the pipeline. Check this command line for the minimal test
Runtime: ~9 minutes
Running nf-core/proteinfold with AlphaFold2
For database and configuration file details, refer to the Databases and Soft Configuration Files sections above.
The table below shows the instance types used in our example runs for the RUN_ALPHAFOLD2 process. Instance type, memory, and CPU are defined directly in the soft configuration file, and users can adapt these values for their own requirements.
Option
Instance type
Soft config files
Applet
GPU - mini database
mem2_ssd1_gpu_x32
proteinfold_gpu_instance.config + beforescript_tarball_minidb.config
proteinfold_v.1.1.1
CPU - full database
mem3_ssd3_x24
run_alphafold2_highmem_cpu.config + beforescript_tarball_fulldb.config
proteinfold_v.1.1.1
GPU - full database
mem2_ssd2_gpu4_v2_x48
alphafold2_R535.config + beforescript_tarball_fulldb.config
proteinfold_modified_GPU_R535_v1.1.1
Below is an example configuration file for a GPU instance (alphafold2_R535.config)
For more information on subjob instance type determination, refer to the DNAnexus’s running nextflow pipeline documentation. Please note that instance times are subject to their queues, so less common instance types may result in longer wait times due to their limited availability.
Running nf-core/proteinfold with ColabFold
For database and configuration file details, refer to the Databases and Configuration Files sections above. For more database information, refer to the nf-core/proteinfold usage documentation. The sample sheet format is the same as AlphaFold2.
The ColabFold workflow runs two processes sequentially: MMSEQS_COLABFOLDSEARCH (MSA search) and COLABFOLD_BATCH (structure prediction). Both processes are defined in soft configuration files.
Option
Instance type
Soft config files
Applet
GPU - full database
mem2_ssd2_gpu1_v2_x64
colabfold_R535.config
proteinfold_v.1.1.1
CPU - full database
mem3_ssd3_x24
run_colabfold_high_storage.config
proteinfold_modified_GPU_R535_v1.1.1
Below are examples of the configuration files: run_colabfold_high_storage.config (both processes run on CPU):
Running nf-core/proteinfold with ESMFold
For database and configuration file details, refer to the Databases and Configuration Files sections above. For more information, refer to the nf-core/proteinfold usage documentation and the ESMFold GitHub repository.
The sample sheet format is the same as AlphaFold2, with additional support for multimer predictions via samplesheet_multimer.csv in samplesheet_input folder
The table below shows the instance types used in our example runs for the RUN_ESMFOLD process. Users can adapt the provided configuration files for their own requirements.
Option
Instance type
Soft config files
Applet
GPU
mem2_ssd1_gpu_x16
run_esmfold_gpu.config
proteinfold_v.1.1.1
CPU
mem2_ssd1_v2_x4
No soft config file
proteinfold_v.1.1.1
Below is an example configuration file for a GPU instance (run_esmfold_gpu.config)
For more information on subjob instance type determination, refer to the DNAnexus’s running nextflow pipeline documentation. Please note that instance times are subject to their queues, so less common instance types may result in longer wait times due to their limited availability.
Technical considerations
outdir parameter:
The outdir parameter is required and defines the subdirectory where results are stored. When launching the pipeline, choose a parent directory (e.g., results) and specify the desired output folder name (e.g., test_full_database) in the outdir field.
Note: You do not need to manually create the test_full_database folder inside results. The pipeline will automatically create it.
Running with GPU:
To enable GPU acceleration, set use_gpu = true and use the appropriate GPU configuration file. Note that: GPU instance types are used only for the AlphaFold2, ColabFold, and ESMFold subworkflow steps. The head node and MultiQC steps always run on the CPU.
The full AlphaFold2 database (~3 TB) and ColabFold database (~1.4 TB) cannot run on Driver R470 due to insufficient storage on available GPU instance types. Full-database runs must use Driver R535 with the proteinfold_modified_GPU_R535_v1.1.1 applet.
Out-of-Memory Issue in RUN_ALPHAFOLD2
Problem: Despite running on a large-memory instance, the RUN_ALPHAFOLD2 process may fail with out-of-memory (OOM) errors at ~34–36 GB RAM. This is caused by Nextflow's per-process memory limits defined in the pipeline's base.config, not the total available memory on the instance:
This launches the AlphaFold2 container with a Docker cgroup memory limit of ~36 GB. When HHblits exceeds this limit, the kernel OOM killer terminates the process.
Solution: Override the resource limits for RUN_ALPHAFOLD2 in a custom configuration file, as shown in run_alphafold2_highmem_cpu.config:
Using Tarball Configuration for pdb_mmcif
The pdb_mmcif dataset contains 188,085 files (92 in the mini version). Transferring this many individual files via object storage is slow due to I/O overhead. To improve performance, we provide the dataset as a single tarball file (pdb_mmcif.tar for full, pdb_mmcif_mini.tar for mini).
To load the tarball, use the appropriate configuration file and update the path to where you stored the tarball:
beforescript_tarball_minidb.config (update the path to pdb_mmcif_mini.tar in soft config file)
beforescript_tarball_fulldb.config (update the path to pdb_mmcif.tar in the soft config file)
Here is beforescript_tarball_fulldb.config
Because pdb_mmcif is provided as a tarball, the pipeline's default pdb_mmcif path must be overridden to avoid a directory not found error. To do this, specify a fake S3 path for pdb_mmcif_path:
This causes the pipeline to assume it is reading from S3, while the actual data is extracted from the tarball via the configuration file.
Database Path for test_profile.config
Two databases (bfd and uniref30) are not included in test_profile.config. However, the DNAnexus Nextflow applet still requires paths for these databases. To avoid directory not found errors, use the following intentionally empty S3 paths when you to replicate the test profile config provided by nf-core/proteinfold:
Check the test_profile_params.json in the config folder for an example:
Misaligned pLDDT Values in MultiQC Output
An off-by-one indexing error exists in the summary .tsv files produced by the MultiQC step (e.g., T1024.1_plddt_mqc.tsv):
Row 409 contains pLDDT values for rank_1–rank_4 despite having no valid residue position (sequence length is only 408 residues).
Row 1 has position = 1 for rank_0 only; rank_1–rank_4 values are missing.
The pLDDT score at position N in the PDB appears at position N+1 in the TSV for rank_1 through rank_4. The rank_0 column is correctly aligned.
A Jupyter notebook (fix_plddt_mqc_file.ipynb) is available on the platform to automatically correct this issue. We have also reported this to the nf-core/proteinfold project (Check this issue). For more information on pipeline outputs, refer to the nf-core/proteinfold output documentation.
We prepared a markdown file to explain the pipeline on DNAnexus and a few considerations when you run the pipeline. The file name is nf-proteinfold_README.md.
Video: Running Proteinfold on the DNAnexus Platform
Last updated