> For the complete documentation index, see [llms.txt](https://academy.dnanexus.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://academy.dnanexus.com/public-datasets-on-the-dnanexus-platform/molecular-modeling/nf-core-proteinfold.md).

# nf-core: Proteinfold

## Necessary Disclaimers and Legal

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with nf-core/proteinfold, protein structure prediction methods like AlphaFold2, EMSFold, ColabFold and genetic databases and their storage in their project spaces.&#x20;

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

## Citation for nf-core/proteinfold

nf-core is a community-driven collection of curated bioinformatics pipelines built using Nextflow, providing standardized, scalable, and portable workflows for reproducible analysis across diverse computing environments. Following these standards, nf-core/proteinfold enables state-of-the-art protein structure prediction through optimized Nextflow execution and currently integrates leading AI methods such as AlphaFold, ESMFold and ColabFold.

If you use nf-core/proteinfold, please cite the pipeline using [nf-core/proteinfold print](https://zenodo.org/records/7437038). Additional references for integrated tools are listed in the pipeline’s [CITATIONS.md](https://github.com/nf-core/proteinfold/blob/1.1.1/CITATIONS.md)

## Overview of nf-core/proteinfold pipeline

The nf-core/proteinfold pipeline provides an automated workflow for protein structure prediction using AlphaFold2, ESMFold, and ColabFold. Users start by uploading a samplesheet that links sequence IDs to their FASTA files, and the pipeline then prepares input data, fetches required databases, and runs the selected folding model. Each FASTA entry is processed as an individual prediction, while multimer sequences can be combined in a single multi-FASTA file.

Built with nf-core and Nextflow best practices, the pipeline is scalable, reproducible, and ready to run on cloud or HPC systems, making advanced deep-learning structure prediction easier to use and share. Detailed information about the pipeline including descriptions, parameters, and outputs, is available at [official documentation](https://nf-co.re/proteinfold/1.1.1) and [Github](https://github.com/nf-core/proteinfold/tree/1.1.1)

On DNAnexus, we have made the nf-core/proteinfold workflow (version 1.1.1) directly available, together with a notebook for preparing inputs and both mini and full database options (up to \~3 TB). See the “Where to Access nf-core/proteinfold” section below to start accessing the dataset.

<img src="/files/SDjQkLN7dgbK4x0adpH5" alt="" height="614" width="437">

Figure: Scalable and reproducible protein structure prediction pipeline on DNAnexus. Input sequences are validated and processed through AlphaFold2, ColabFold, or ESMFold workflows, with standardized database preparation and parameterization, producing PDB structure outputs.

## Where to Access nf-core/proteinfold

1. The DNAnexus Platform has projects ([AWS US East](https://platform.dnanexus.com/panx/projects/J3Kf7bj03P0XJ4b5xp556pG8/data/ProteinFold), [AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/ProteinFold), [AWS Europe (London)](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/ProteinFold), [Azure Amsterdam,](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/ProteinFold) [Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/ProteinFold)) that includes the following resources: applets, sample sheet and input files, soft configuration, and databases.

The nf-core/proteinfold project on DNAnexus is organized as follows:

```
ProteinFold/
├── applets/              # proteinfold_v.1.1.1, proteinfold_modified_GPU_R535_v1.1.1
├── colabfold/            # Full ColabFold database (~1 TB) and mini database
├── configs/              # GPU, CPU, and tarball configuration files
├── Esmfold_db/           # ESMFold model assets (~8 GB)
├── full_AlphaFold_db/    # Full AlphaFold2 database (~3 TB) and mini version (~340 GB)
├── results/              # Pipeline output directory
├── samplesheet_input/    # Sample sheets (.csv) and example .fasta files
├── fix_plddt_mqc_file.ipynb
└── nf-proteinfold_1_1_1.md
```

2. Applets:
   1. proteinfold\_v.1.1.1:  Imported nf-core/proteinfold version 1.1.1, available on the platform&#x20;
   2. proteinfold\_modified\_GPU\_R535\_v1.1.1:  While based on nf-core/proteinfold v1.1.1, the standard release only supports CUDA 11 (Driver R470). To upgrade support to CUDA 12 (Driver R535), we built this applet using the development containers for both AlphaFold2 and ColabFold. Available on the platform.
3. Sample sheet & Input files:&#x20;
   1. A sample sheet from GitHub's [nf-core/proteinfold](https://github.com/nf-core/proteinfold/blob/1.1.1/assets/samplesheet.csv) and sample sheets prepared using our notebook are available on the platform. All sample sheets are in .csv format.
   2. To simplify sample sheet creation, we prepared a Jupyter notebook that generates a valid CSV file. In our example, we used five [CASP14 targets](https://predictioncenter.org/casp14/targetlist.cgi) (T1024–T1028). You can refer to this notebook to prepare your own input files
   3. Example .fasta files from [CASP14 targets](https://predictioncenter.org/casp14/targetlist.cgi) are also provided on the platform.&#x20;
4. Soft configuration files: Configuration files are provided as examples for users to adapt for their own runs, available on the platform. They are organized into three groups:
5. GPU/CPU instance setup:
   1. proteinfold\_gpu\_instance.config: GPU with Driver R470 (AlphaFold2)
   2. alphafold2\_R535.config: GPU with Driver R535 (AlphaFold2); use with proteinfold\_modified\_GPU\_R535\_v1.1.1 applet
   3. colabfold\_R535.config: GPU with Driver R535 (ColabFold); use with proteinfold\_modified\_GPU\_R535\_v1.1.1 applet
   4. run\_alphafold2\_highmem\_cpu.config: High-memory CPU (AlphaFold2)
   5. run\_colabfold\_high\_storage.config: High-storage CPU (ColabFold)
   6. run\_esmfold\_gpu.config: GPU (ESMFold)
6. pdb\_mmcif tarball setup: load the pdb\_mmcif dataset from a single tarball file instead of transferring thousands of individual files
   1. beforescript\_tarball\_minidb.config: for the mini database
   2. beforescript\_tarball\_fulldb.config: for the full database
7. Parameter file:
   1. test\_profile\_params.json: example parameter file
8. Databases: Pre-downloaded databases are available on the platform to avoid downloading them at runtime. The following databases are provided:&#x20;
   1. AlphaFold2: Full database (\~3 TB) and Mini database (\~340 GB)&#x20;
   2. ColabFold: Full database (\~1 TB) and Mini database&#x20;
   3. ESMFold: Model assets (\~8 GB)&#x20;

Here is an example folder structure of AlphaFold2 - available on the Platform [AWS US East](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/full_AlphaFold_db/nf-core-aws-s3), [AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/ProteinFold/full_AlphaFold_db), [AWS Europe (London)](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/ProteinFold/full_AlphaFold_db), [Azure Amsterdam](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/ProteinFold/full_AlphaFold_db), and [Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/ProteinFold/full_AlphaFold_db).

```
full_AlphaFold_db/
├── nf-core-aws-mini-s3/        # Mini database (~340 GB) — recommended for testing
└── nf-core-aws-s3/             # Full database (~3 TB)
    ├── alphafold_params_2022-12-06/
    ├── bfd/
    ├── mgnify/
    ├── pdb70/
    ├── pdb_seqres/
    ├── small_bfd/
    ├── uniclust30/
    ├── uniprot/
    ├── uniref30/
    ├── uniref90/
    └── pdb_mmcif.tar            # pdb_mmcif dataset packed as a single tarball
                                 # to avoid I/O overhead from transferring
                                 # 188,085 individual files
```

To use these resources, you have two options:

* Copy to your own project by cloning the datasets and applets into your own DNAnexus project. Instructions are included in the "Copying Data and Applets into a Project" section.
* Load directly from the public dataset by referencing the databases directly from the Public Datasets (our example is in AWS US (East), but there are multiple regions with the Public Datasets) without copying, which avoids storage costs in your own project.

## Running nf-core/proteinfold on DNAnexus

### Copying Data and Applets into a Project

To utilize the dataset and applets, please copy the data from the project listed above into your own project. Here are the steps to copy the data into a Project Space:

1. Create a project for your analysis, billed to your own organization. Tutorials on how to set up a project can be [found on this page](https://academy.dnanexus.com/overview-of-the-platform/setting-up-a-project).
2. Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder "Proteinfold". &#x20;
3. Select applets, samplesheet\_input, config folders
4. Select "Copy" on the top right menu, and select the project that you created in Step 1.&#x20;
5. Then, go to the project space you created in Step 1 to start exploring the pipeline.
6. To run the Nextflow on DNAnexus, please see the [Running Nextflow Pipelines Documentation](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines). To learn more about Nextflow on DNAnexus please read the [Academy Documentation](https://academy.dnanexus.com/buildingworkflows/nf/overviewnextflow).

### Preparing the Sample Sheet Input

A sample sheet (.csv) is required to describe the sequences to be analyzed. Two columns are required:

* sequence:  sequence identifier (e.g., T1024, T1025, …)
* fasta: full path to the FASTA file. Extension must be .fasta or .fa

Here is our example:

```
sequence,fasta
T1024,dx://project-J3JyY6j030gzQypGpk273241:/ProteinFold/samplesheet_input/Fasta/T1024.fasta
T1025,dx://project-J3JyY6j030gzQypGpk273241:/ProteinFold/samplesheet_input/Fasta/T1025.fasta
T1026,dx://project-J3JyY6j030gzQypGpk273241:/ProteinFold/samplesheet_input/Fasta/T1026.fasta
T1027,dx://project-J3JyY6j030gzQypGpk273241:/ProteinFold/samplesheet_input/Fasta/T1027.fasta
T1028,dx://project-J3JyY6j030gzQypGpk273241:/ProteinFold/samplesheet_input/Fasta/T1028.fasta
```

We provide the following sample sheets and notebooks on the Platform [for AWS US East](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/samplesheet_input), [AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/ProteinFold/samplesheet_input), [AWS Europe (London)](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/ProteinFold/samplesheet_input), [Azure Amsterdam](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/ProteinFold/samplesheet_input), [Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/ProteinFold/samplesheet_input):&#x20;

* samplesheet\_proteinfold.csv:  example sample sheet from nf-core/proteinfold with two samples (T1024 and T1026)
* prepare\_samplesheet.ipynb: Jupyter notebook to generate a valid sample sheet. In our example, we used five CASP14 targets (T1024–T1028)
* samplesheet.csv: generated sample sheet from the notebook above

Please refer to the [nf-core/proteinfold website](https://nf-co.re/proteinfold/1.1.1/docs/usage) for more information about the input file.

### Minimal test

Use the test configuration (test.config) provided by nf-core/proteinfold to quickly validate the pipeline. Check this command line for the minimal test

```
dx run project-xxx:/ProteinFold/applets/proteinfold_v1.1.1 \
  --destination project-xx:/users/test_proteinfold/ \
  --priority high \
  --name "proteinfold_test" \
  -inextflow_run_opts="-profile test,docker" \
  -ioutdir="output_proteinfold_v1.1.1_test_profile" \
  -y
```

Runtime: \~9 minutes

### Running nf-core/proteinfold with AlphaFold2

For database and configuration file details, refer to the Databases and Soft Configuration Files sections above.

The table below shows the instance types used in our example runs for the RUN\_ALPHAFOLD2 process. Instance type, memory, and CPU are defined directly in the soft configuration file, and users can adapt these values for their own requirements.

| Option              | Instance type             | Soft config files                                                           | Applet                                   |
| ------------------- | ------------------------- | --------------------------------------------------------------------------- | ---------------------------------------- |
| GPU - mini database | mem2\_ssd1\_gpu\_x32      | proteinfold\_gpu\_instance.config + beforescript\_tarball\_minidb.config    | proteinfold\_v.1.1.1                     |
| CPU - full database | mem3\_ssd3\_x24           | run\_alphafold2\_highmem\_cpu.config + beforescript\_tarball\_fulldb.config | proteinfold\_v.1.1.1                     |
| GPU - full database | mem2\_ssd2\_gpu4\_v2\_x48 | alphafold2\_R535.config + beforescript\_tarball\_fulldb.config              | proteinfold\_modified\_GPU\_R535\_v1.1.1 |

Below is an example configuration file for a GPU instance (**alphafold2\_R535.config**)

```
docker.runOptions = '--entrypoint "" $(if command -v nvidia-smi &> /dev/null; then echo "--gpus all "; else echo "-u $(id -u):$(id -g) "; fi)'
process {
    withName: 'RUN_ALPHAFOLD2' {
        machineType = 'mem2_ssd2_gpu4_v2_x48'
        memory = 192.GB
        cpus = 48
        time = 24.h
    }
}
```

For more information on subjob instance type determination, refer to the [DNAnexus’s running nextflow pipeline documentation](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines). Please note that instance times are subject to their queues, so less common instance types may result in longer wait times due to their limited availability.

### Running nf-core/proteinfold with ColabFold

For database and configuration file details, refer to the Databases and Configuration Files sections above. For more database information, refer to the [nf-core/proteinfold usage documentation](https://nf-co.re/proteinfold/1.1.1/). The sample sheet format is the same as AlphaFold2.

The ColabFold workflow runs two processes sequentially: MMSEQS\_COLABFOLDSEARCH (MSA search) and COLABFOLD\_BATCH (structure prediction). Both processes are defined in soft configuration files.

| Option              | Instance type             | Soft config files                    | Applet                                   |
| ------------------- | ------------------------- | ------------------------------------ | ---------------------------------------- |
| GPU - full database | mem2\_ssd2\_gpu1\_v2\_x64 | colabfold\_R535.config               | proteinfold\_v.1.1.1                     |
| CPU - full database | mem3\_ssd3\_x24           | run\_colabfold\_high\_storage.config | proteinfold\_modified\_GPU\_R535\_v1.1.1 |

Below are examples of the configuration files: **run\_colabfold\_high\_storage.config** (both processes run on CPU):

```
process {
    withName: 'NFCORE_PROTEINFOLD:COLABFOLD:MMSEQS_COLABFOLDSEARCH' {
        machineType = "mem3_ssd3_x24"
        memory = 190.GB
        cpus   = 24
        time   = 24.h
    }
    withName: 'NFCORE_PROTEINFOLD:COLABFOLD:COLABFOLD_BATCH' {
        machineType = "mem3_ssd3_x24"
        memory = 190.GB
        cpus   = 24
        time   = 24.h
    }
}
```

### Running nf-core/proteinfold with ESMFold

For database and configuration file details, refer to the Databases and Configuration Files sections above. For more information, refer to the [nf-core/proteinfold usage documentation](https://nf-co.re/proteinfold/1.1.1/docs/usage/) and the [ESMFold GitHub repository.](https://github.com/facebookresearch/esm)

The sample sheet format is the same as AlphaFold2, with additional support for multimer predictions via samplesheet\_multimer.csv in samplesheet\_input folder

The table below shows the instance types used in our example runs for the RUN\_ESMFOLD process. Users can adapt the provided configuration files for their own requirements.

| Option | Instance type        | Soft config files        | Applet               |
| ------ | -------------------- | ------------------------ | -------------------- |
| GPU    | mem2\_ssd1\_gpu\_x16 | run\_esmfold\_gpu.config | proteinfold\_v.1.1.1 |
| CPU    | mem2\_ssd1\_v2\_x4   | No soft config file      | proteinfold\_v.1.1.1 |

Below is an example configuration file for a GPU instance (run\_esmfold\_gpu.config)

```
docker.runOptions = '$(if command -v nvidia-smi &> /dev/null; then echo "--gpus all"; else echo "-u $(id -u):$(id -g)"; fi)'

process {
    withName: 'RUN_ESMFOLD' {
        machineType = 'mem2_ssd1_gpu_x16'
    }
}
```

For more information on subjob instance type determination, refer to the [DNAnexus’s running nextflow pipeline documentation](https://documentation.dnanexus.com/user/running-apps-and-workflows/running-nextflow-pipelines). Please note that instance times are subject to their queues, so less common instance types may result in longer wait times due to their limited availability.

## Technical considerations

### outdir parameter:

The **outdir** parameter is required and defines the subdirectory where results are stored. When launching the pipeline, choose a parent directory (e.g., **results**) and specify the desired output folder name (e.g., **test\_full\_database**) in the **outdir** field.

Note: You do not need to manually create the test\_full\_database folder inside results. The pipeline will automatically create it.

### Running with GPU:

To enable GPU acceleration, set use\_gpu = true and use the appropriate GPU configuration file. Note that: GPU instance types are used only for the AlphaFold2, ColabFold, and ESMFold subworkflow steps. The head node and MultiQC steps always run on the CPU.

The full AlphaFold2 database (\~3 TB) and ColabFold database (\~1.4 TB) cannot run on Driver R470 due to insufficient storage on available GPU instance types. Full-database runs must use Driver R535 with the **proteinfold\_modified\_GPU\_R535\_v1.1.1 applet.**

### Out-of-Memory Issue in RUN\_ALPHAFOLD2

Problem: Despite running on a large-memory instance, the RUN\_ALPHAFOLD2 process may fail with out-of-memory (OOM) errors at \~34–36 GB RAM. This is caused by Nextflow's per-process memory limits defined in the pipeline's base.config, not the total available memory on the instance:

```
withLabel: process_medium {
    memory = 36.GB
}
```

This launches the AlphaFold2 container with a Docker cgroup memory limit of \~36 GB. When HHblits exceeds this limit, the kernel OOM killer terminates the process.

Solution: Override the resource limits for RUN\_ALPHAFOLD2 in a custom configuration file, as shown in run\_alphafold2\_highmem\_cpu.config:

```
process {
    withName: 'NFCORE_PROTEINFOLD:ALPHAFOLD2:RUN_ALPHAFOLD2' {
        machineType = "mem3_ssd3_x24"
        memory = 190.GB
        cpus   = 24
        time   = 24.h
    }
}
```

### Using Tarball Configuration for pdb\_mmcif

The **pdb\_mmcif** dataset contains 188,085 files (92 in the mini version). Transferring this many individual files via object storage is slow due to I/O overhead. To improve performance, we provide the dataset as a single tarball file (**pdb\_mmcif.tar** for full, **pdb\_mmcif\_mini.tar** for mini).

To load the tarball, use the appropriate configuration file and update the path to where you stored the tarball:

* beforescript\_tarball\_minidb.config (update the path to pdb\_mmcif\_mini.tar in soft config file)
* beforescript\_tarball\_fulldb.config (update the path to pdb\_mmcif.tar in the soft config file)

Here is **beforescript\_tarball\_fulldb.config**&#x20;

```
process {
    withName: 'RUN_ALPHAFOLD2' {
        beforeScript = """
            set -e
            echo "Extracting tarball to dx_tmp..."
            mkdir -p dx_tmp
            dx cat project-J3JyY6j030gzQypGpk273241:/ProteinFold/full_AlphaFold_db/nf-core-aws-s3/pdb_mmcif.tar | tar -xf - -C /tmp/nxf.*
            echo "✓ Extraction complete"
        """
    }
}
```

Because **pdb\_mmcif** is provided as a tarball, the pipeline's default **pdb\_mmcif path** must be overridden to avoid a directory not found error. To do this, specify a fake S3 path for **pdb\_mmcif\_path:**

```
s3://proteinfold-dataset/test-data/db/alphafold_mini/pdb_mmcif_nonexist/*
```

This causes the pipeline to assume it is reading from S3, while the actual data is extracted from the tarball via the configuration file.

### Database Path for test\_profile.config

Two databases (**bfd** and **uniref30**) are not included in test\_profile.config. However, the DNAnexus Nextflow applet still requires paths for these databases. To avoid directory not found errors, use the following intentionally empty S3 paths when you to replicate the test profile config provided by nf-core/proteinfold:

```
s3://proteinfold-dataset/test-data/db/alphafold_mini/bfd/*
s3://proteinfold-dataset/test-data/db/alphafold_mini/uniref30/*
```

Check the test\_profile\_params.json in the [config folder](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold/configs) for an example:

```
"mode": "alphafold2",
  "use_gpu": true,

  "alphafold2_db": "dx://project-J3Kf7bj03P0XJ4b5xp556pG8:/ProteinFold/full_AlphaFold_db/nf-core-aws-mini-s3",
  "full_dbs": false,
  "bfd_path": "s3://proteinfold-dataset/test-data/db/alphafold_mini/bfd/*",
  "pdb_mmcif_path": "s3://proteinfold-dataset/test-data/db/alphafold_mini/pdb_mmcif_nonexist/*",
  "uniref30_alphafold2_path": "s3://proteinfold-dataset/test-data/db/alphafold_mini/uniref30/*",
```

### Misaligned pLDDT Values in MultiQC Output

An off-by-one indexing error exists in the summary .tsv files produced by the MultiQC step (e.g., T1024.1\_plddt\_mqc.tsv):<br>

* Row 409 contains pLDDT values for rank\_1–rank\_4 despite having no valid residue position (sequence length is only 408 residues).
* Row 1 has position = 1 for rank\_0 only; rank\_1–rank\_4 values are missing.
* The pLDDT score at position N in the PDB appears at position N+1 in the TSV for rank\_1 through rank\_4. The rank\_0 column is correctly aligned.

A Jupyter notebook (fix\_plddt\_mqc\_file.ipynb) is available on the platform to automatically correct this issue. We have also reported this to the nf-core/proteinfold project ([Check this issue)](https://github.com/nf-core/proteinfold/issues/419). For more information on pipeline outputs, refer to the[ nf-core/proteinfold output documentation](https://nf-co.re/proteinfold/1.1.1/docs/output/).

We prepared a markdown file to explain the pipeline on DNAnexus and a few considerations when you run the pipeline. The file name is [nf-proteinfold\_README.md](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/ProteinFold).

### Video: Running Proteinfold on the DNAnexus Platform

{% embed url="<https://youtu.be/UoOEkWgiVdM>" %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.dnanexus.com/public-datasets-on-the-dnanexus-platform/molecular-modeling/nf-core-proteinfold.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
