# nf-core: Proteinfold

## Necessary Disclaimers and Legal

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with nf-core/proteinfold, protein structure prediction methods like Alphafold2, EMSFold, ColabFold and genetic databases and their storage in their project spaces.&#x20;

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

## Citation for nf-core/proteinfold

nf-core is a community-driven collection of curated bioinformatics pipelines built using Nextflow, providing standardized, scalable, and portable workflows for reproducible analysis across diverse computing environments. Following these standards, nf-core/proteinfold enables state-of-the-art protein structure prediction through optimized Nextflow execution and currently integrates leading AI methods such as AlphaFold, ESMFold and ColabFold.

If you use nf-core/proteinfold, please cite the pipeline using [nf-core/proteinfold print](https://zenodo.org/records/7437038). Additional references for integrated tools are listed in the pipeline’s [CITATIONS.md](https://github.com/nf-core/proteinfold/blob/1.1.1/CITATIONS.md)

## Overview of nf-core/proteinfold pipeline

The nf-core/proteinfold pipeline provides an automated workflow for protein structure prediction using AlphaFold2, ESMFold, and ColabFold. Users start by uploading a samplesheet that links sequence IDs to their FASTA files, and the pipeline then prepares input data, fetches required databases, and runs the selected folding model. Each FASTA entry is processed as an individual prediction, while multimer sequences can be combined in a single multi-FASTA file. Built with nf-core and Nextflow best practices, the pipeline is scalable, reproducible, and ready to run on cloud or HPC systems, making advanced deep-learning structure prediction easier to use and share. Detailed information about the pipeline including descriptions, parameters, and outputs, is available at [official documentation](https://nf-co.re/proteinfold/1.1.1) and [Github](https://github.com/nf-core/proteinfold/tree/1.1.1)

On DNAnexus, we have made the nf-core/proteinfold workflow (version 1.1.1) directly available, together with a notebook for preparing inputs and both mini and full database options (up to \~3 TB). See the “Where to Access nf-core/proteinfold” section below to start accessing the dataset.

## Where to Access nf-core/proteinfold

[The DNAnexus platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold) includes:

* Imported nf-core/proteinfold version 1.1.1 with applet ID: applet-J5PY67j03fbk0Xz7vqp7x9bb [on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/applets).
* While based on nf-core/proteinfold v1.1.1, the standard release only supports CUDA 11 (Driver R470). To upgrade support to CUDA 12 (Driver R535), we built the proteinfold\_modified\_GPU\_R535\_v1.1.1 applet by utilizing the development containers for both AlphaFold2 and ColabFold2. Access this applet [on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/applets).
* A sample sheet from [Github's nf-core/proteinfold](https://github.com/nf-core/proteinfold/blob/1.1.1/assets/samplesheet.csv) is available [on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/samplesheet_input) and sample sheets which are prepared using our notebook. Sample sheets are in .csv format. We also provided example .fasta files [on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/samplesheet_input/Fasta).&#x20;
* Configuration files are included for GPU, CPU run and tarball setups for the pdb\_mmcif dataset with a large number of files. The location of these config files are [here on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/configs).&#x20;
* Full AlphaFold2 databases and model parameters (\~3 TB), as well as a smaller “mini” version (\~340 GB), are available for testing. These files are [here on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/full_AlphaFold_db).
* Full ColabFold databases and model parameters (\~ 1 TB). These files are [here on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/colabfold)
* Data assets for the ESMFold model (\~8 GB) are also provided [here on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/Esmfold_db).

To use these resources, simply copy the datasets and notebooks into your own DNAnexus project. Instructions are included in the “Copying Data and Notebooks into a Project’’ section.

## Running nf-core/proteinfold on DNAnexus

### Copying Data and Notebooks into a Project

#### To utilize the dataset, please [copy the data from this project](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold) into your own project. Here are the steps to copy the data into a Project Space:

1. Create a project for your single cell analysis, billed to your own organization. Tutorials on how to set up a project can be [found on this page](https://academy.dnanexus.com/overview-of-the-platform/setting-up-a-project).
2. Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder "ProteinFold". &#x20;
3. Select the data folder and the notebook
4. Select "Copy" on the top right menu, and select the project that you created in Step 1.&#x20;
5. Then, go to the project space you created in Step 1 to start exploring the pipeline.
6. To run the Nextflow on DNAnexus, please see the [Running Nextflow Pipelines Documentation](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines). To learn more about Nextflow on DNAnexus please read the [Academy Documentation](https://academy.dnanexus.com/buildingworkflows/nf/overviewnextflow).

### Instance Type Selection

* Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability
* GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.&#x20;
* For selecting CPU and GPU, please use the [prepared configuration files](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/configs) as examples and [nf-proteinfold documentation](https://nf-co.re/proteinfold/1.1.1/parameters/)
* Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.&#x20;
* Requirements for our example nf-core/proteinfold pipeline:
  * nf-core/proteinfold pipeline with mini version of databases for Alphafold2
    * GPU Instance type for Alphafold2 step: mem2\_ssd1\_gpu\_x32
    * Config files: proteinfold\_gpu\_instance.config, beforescript\_tarball\_minidb.config
    * Applet: proteinfold\_v.1.1.1
* nf-core/proteinfold pipeline with full version of databases for Alphafold2
  * CPU instance type for Alphafold2 step: mem3\_ssd3\_x24
    * Config files: beforescript\_tarball\_fulldb.config, run\_alphafold2\_highmem\_cpu.config
    * Applet: proteinfold\_v.1.1.1
  * GPU instance type for Alphafold2 step: mem2\_ssd2\_gpu4\_v2\_x48
    * Config files: alphafold2\_R535.config, beforescript\_tarball\_fulldb.config
    * Applet: proteinfold\_modified\_GPU\_R535\_v1.1.1
* nf-core/proteinfold pipeline with full version of databases for ColabFold2
  * CPU instance type for ColabFold2 step: mem3\_ssd3\_x24
    * Config files: full\_db\_cpu\_colabfold.config
    * Applet: proteinfold\_v.1.1.1
  * GPU instance type for ColabFold2 step: mem2\_ssd2\_gpu1\_v2\_x64
    * Config files: proteinfold\_gpu\_instance\_full\_db\_Colabfold.config
    * Applet: proteinfold\_modified\_GPU\_R535\_v1.1.1
* nf-core/proteinfold pipeline with full version of databases for EMSFold
  * CPU instance type for EMSFold step: mem2\_ssd1\_v2\_x4
    * Applet: proteinfold\_v.1.1.1
  * GPU instance type for EMSFold step: mem2\_ssd1\_gpu\_x16
    * Config files: run\_esmfold\_gpu.config
    * Applet: proteinfold\_v.1.1.1

### A note on nf-core/proteinfold pipeline

* A note on output file from MultiQC step in pipeline
  * We identified an off-by-one pLDDT indexing issue in the summary .tsv files, prepared a notebook to automatically correct it, and the notebook is [here on the platform](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold). The file name is fix\_plddt\_mqc\_file.ipynb. We also reported the problem to the [nf-core/proteinfold project](https://github.com/nf-core/proteinfold/issues/419).
* You can learn more about outputs from pipeline in the [nf-core/proteinfold documentation](https://nf-co.re/proteinfold/1.1.1/docs/output/)
* A note on outdir paramater:
  * Specify the folder name where you want to save your output. You do not need to create this folder beforehand, as the pipeline will create it automatically. You can provide  the destination path (parent folder) in the destination parameter.
* We prepared a markdown to explain the pipeline on DNAnexus and few considerations when you run the pipeline. The file name is [nf-proteinfold.md](https://platform.dnanexus.com/panx/projects/J3Kf7bj03P0XJ4b5xp556pG8/data/ProteinFold)

### Video: Running Proteinfold on the DNAnexus Platform

{% embed url="<https://youtu.be/UoOEkWgiVdM>" %}
