nf-core: Proteinfold
Necessary Disclaimers and Legal
The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.
Users are responsible for the costs associated with nf-core/proteinfold, protein structure prediction methods like Alphafold2, EMSFold, ColabFold and genetic databases and their storage in their project spaces.
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Citation for nf-core/proteinfold
nf-core is a community-driven collection of curated bioinformatics pipelines built using Nextflow, providing standardized, scalable, and portable workflows for reproducible analysis across diverse computing environments. Following these standards, nf-core/proteinfold enables state-of-the-art protein structure prediction through optimized Nextflow execution and currently integrates leading AI methods such as AlphaFold, ESMFold and ColabFold.
If you use nf-core/proteinfold, please cite the pipeline using nf-core/proteinfold print. Additional references for integrated tools are listed in the pipeline’s CITATIONS.md
Overview of nf-core/proteinfold pipeline
The nf-core/proteinfold pipeline provides an automated workflow for protein structure prediction using AlphaFold2, ESMFold, and ColabFold. Users start by uploading a samplesheet that links sequence IDs to their FASTA files, and the pipeline then prepares input data, fetches required databases, and runs the selected folding model. Each FASTA entry is processed as an individual prediction, while multimer sequences can be combined in a single multi-FASTA file. Built with nf-core and Nextflow best practices, the pipeline is scalable, reproducible, and ready to run on cloud or HPC systems, making advanced deep-learning structure prediction easier to use and share. Detailed information about the pipeline including descriptions, parameters, and outputs, is available at official documentation and Github
On DNAnexus, we have made the nf-core/proteinfold workflow (version 1.1.1) directly available, together with a notebook for preparing inputs and both mini and full database options (up to ~3 TB). See the “Where to Access nf-core/proteinfold” section below to start accessing the dataset.
Where to Access nf-core/proteinfold
The DNAnexus platform includes:
Imported nf-core/proteinfold version 1.1.1 with applet ID: applet-J5PY67j03fbk0Xz7vqp7x9bb on the platform.
While based on nf-core/proteinfold v1.1.1, the standard release only supports CUDA 11 (Driver R470). To upgrade support to CUDA 12 (Driver R535), we built the proteinfold_modified_GPU_R535_v1.1.1 applet by utilizing the development containers for both AlphaFold2 and ColabFold2. Access this applet on the platform.
A sample sheet from Github's nf-core/proteinfold is available on the platform and sample sheets which are prepared using our notebook. Sample sheets are in .csv format. We also provided example .fasta files on the platform.
Configuration files are included for GPU, CPU run and tarball setups for the pdb_mmcif dataset with a large number of files. The location of these config files are here on the platform.
Full AlphaFold2 databases and model parameters (~3 TB), as well as a smaller “mini” version (~340 GB), are available for testing. These files are here on the platform.
Full ColabFold databases and model parameters (~ 1 TB). These files are here on the platform
Data assets for the ESMFold model (~8 GB) are also provided here on the platform.
To use these resources, simply copy the datasets and notebooks into your own DNAnexus project. Instructions are included in the “Copying Data and Notebooks into a Project’’ section.
Running nf-core/proteinfold on DNAnexus
Copying Data and Notebooks into a Project
To utilize the dataset, please copy the data from this project into your own project. Here are the steps to copy the data into a Project Space:
Create a project for your single cell analysis, billed to your own organization. Tutorials on how to set up a project can be found on this page.
Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder "ProteinFold".
Select the data folder and the notebook
Select "Copy" on the top right menu, and select the project that you created in Step 1.
Then, go to the project space you created in Step 1 to start exploring the pipeline.
To run the Nextflow on DNAnexus, please see the Running Nextflow Pipelines Documentation. To learn more about Nextflow on DNAnexus please read the Academy Documentation.
Instance Type Selection
Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability
GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.
For selecting CPU and GPU, please use the prepared configuration files as examples and nf-proteinfold documentation
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
Requirements for our example nf-core/proteinfold pipeline:
nf-core/proteinfold pipeline with mini version of databases for Alphafold2
GPU Instance type for Alphafold2 step: mem2_ssd1_gpu_x32
Config files: proteinfold_gpu_instance.config, beforescript_tarball_minidb.config
Applet: proteinfold_v.1.1.1
nf-core/proteinfold pipeline with full version of databases for Alphafold2
CPU instance type for Alphafold2 step: mem3_ssd3_x24
Config files: beforescript_tarball_fulldb.config, run_alphafold2_highmem_cpu.config
Applet: proteinfold_v.1.1.1
GPU instance type for Alphafold2 step: mem2_ssd2_gpu4_v2_x48
Config files: alphafold2_R535.config, beforescript_tarball_fulldb.config
Applet: proteinfold_modified_GPU_R535_v1.1.1
nf-core/proteinfold pipeline with full version of databases for ColaFold2
CPU instance type for ColaFold2 step: mem3_ssd3_x24
Config files: full_db_cpu_colabfold.config
Applet: proteinfold_v.1.1.1
GPU instance type for ColaFold2 step: mem2_ssd2_gpu1_v2_x64
Config files: proteinfold_gpu_instance_full_db_Colabfold.config
Applet: proteinfold_modified_GPU_R535_v1.1.1
nf-core/proteinfold pipeline with full version of databases for EMSFold
CPU instance type for EMSFold step: mem2_ssd1_v2_x4
Applet: proteinfold_v.1.1.1
GPU instance type for EMSFold step: mem2_ssd1_gpu_x16
Config files: run_esmfold_gpu.config
Applet: proteinfold_v.1.1.1
A note on nf-core/proteinfold pipeline
A note on output file from MultiQC step in pipeline
We identified an off-by-one pLDDT indexing issue in the summary .tsv files, prepared a notebook to automatically correct it, and the notebook is here on the platform. The file name is fix_plddt_mqc_file.ipynb. We also reported the problem to the nf-core/proteinfold project.
You can learn more about outputs from pipeline in the nf-core/proteinfold documentation
A note on outdir paramater:
Specify the folder name where you want to save your output. You do not need to create this folder beforehand, as the pipeline will create it automatically. You can provide the destination path (parent folder) in the destination parameter.
We prepared a markdown to explain the pipeline on DNAnexus and few considerations when you run the pipeline. The file name is nf-proteinfold.md
Last updated
Was this helpful?