nf-core: Proteinfold

The user is responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for the costs associated with nf-core/proteinfold, protein structure prediction methods like Alphafold2, EMSFold, ColabFold and genetic databases and their storage in their project spaces.

Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

Citation for nf-core/proteinfold

nf-core is a community-driven collection of curated bioinformatics pipelines built using Nextflow, providing standardized, scalable, and portable workflows for reproducible analysis across diverse computing environments. Following these standards, nf-core/proteinfold enables state-of-the-art protein structure prediction through optimized Nextflow execution and currently integrates leading AI methods such as AlphaFold, ESMFold and ColabFold.

If you use nf-core/proteinfold, please cite the pipeline using nf-core/proteinfold printarrow-up-right. Additional references for integrated tools are listed in the pipeline’s CITATIONS.mdarrow-up-right

Overview of nf-core/proteinfold pipeline

The nf-core/proteinfold pipeline provides an automated workflow for protein structure prediction using AlphaFold2, ESMFold, and ColabFold. Users start by uploading a samplesheet that links sequence IDs to their FASTA files, and the pipeline then prepares input data, fetches required databases, and runs the selected folding model. Each FASTA entry is processed as an individual prediction, while multimer sequences can be combined in a single multi-FASTA file. Built with nf-core and Nextflow best practices, the pipeline is scalable, reproducible, and ready to run on cloud or HPC systems, making advanced deep-learning structure prediction easier to use and share. Detailed information about the pipeline including descriptions, parameters, and outputs, is available at official documentationarrow-up-right and Githubarrow-up-right

On DNAnexus, we have made the nf-core/proteinfold workflow (version 1.1.1) directly available, together with a notebook for preparing inputs and both mini and full database options (up to ~3 TB). See the “Where to Access nf-core/proteinfold” section below to start accessing the dataset.

Where to Access nf-core/proteinfold

The DNAnexus platformarrow-up-right includes:

To use these resources, simply copy the datasets and notebooks into your own DNAnexus project. Instructions are included in the “Copying Data and Notebooks into a Project’’ section.

Running nf-core/proteinfold on DNAnexus

Copying Data and Notebooks into a Project

To utilize the dataset, please copy the data from this projectarrow-up-right into your own project. Here are the steps to copy the data into a Project Space:

  1. Create a project for your single cell analysis, billed to your own organization. Tutorials on how to set up a project can be found on this pagearrow-up-right.

  2. Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder "ProteinFold".

  3. Select the data folder and the notebook

  4. Select "Copy" on the top right menu, and select the project that you created in Step 1.

  5. Then, go to the project space you created in Step 1 to start exploring the pipeline.

  6. To run the Nextflow on DNAnexus, please see the Running Nextflow Pipelines Documentationarrow-up-right. To learn more about Nextflow on DNAnexus please read the Academy Documentationarrow-up-right.

Instance Type Selection

  • Instances times are subject to their queues. Less common instance types may result in longer wait times due to their limited availability

  • GPU Instances take longer to set up compared to singular CPU instance types due to their availability and complexity.

  • For selecting CPU and GPU, please use the prepared configuration filesarrow-up-right as examples and nf-proteinfold documentationarrow-up-right

  • Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

  • Requirements for our example nf-core/proteinfold pipeline:

    • nf-core/proteinfold pipeline with mini version of databases for Alphafold2

      • GPU Instance type for Alphafold2 step: mem2_ssd1_gpu_x32

      • Config files: proteinfold_gpu_instance.config, beforescript_tarball_minidb.config

      • Applet: proteinfold_v.1.1.1

  • nf-core/proteinfold pipeline with full version of databases for Alphafold2

    • CPU instance type for Alphafold2 step: mem3_ssd3_x24

      • Config files: beforescript_tarball_fulldb.config, run_alphafold2_highmem_cpu.config

      • Applet: proteinfold_v.1.1.1

    • GPU instance type for Alphafold2 step: mem2_ssd2_gpu4_v2_x48

      • Config files: alphafold2_R535.config, beforescript_tarball_fulldb.config

      • Applet: proteinfold_modified_GPU_R535_v1.1.1

  • nf-core/proteinfold pipeline with full version of databases for ColaFold2

    • CPU instance type for ColaFold2 step: mem3_ssd3_x24

      • Config files: full_db_cpu_colabfold.config

      • Applet: proteinfold_v.1.1.1

    • GPU instance type for ColaFold2 step: mem2_ssd2_gpu1_v2_x64

      • Config files: proteinfold_gpu_instance_full_db_Colabfold.config

      • Applet: proteinfold_modified_GPU_R535_v1.1.1

  • nf-core/proteinfold pipeline with full version of databases for EMSFold

    • CPU instance type for EMSFold step: mem2_ssd1_v2_x4

      • Applet: proteinfold_v.1.1.1

    • GPU instance type for EMSFold step: mem2_ssd1_gpu_x16

      • Config files: run_esmfold_gpu.config

      • Applet: proteinfold_v.1.1.1

A note on nf-core/proteinfold pipeline

  • A note on output file from MultiQC step in pipeline

  • You can learn more about outputs from pipeline in the nf-core/proteinfold documentationarrow-up-right

  • A note on outdir paramater:

    • Specify the folder name where you want to save your output. You do not need to create this folder beforehand, as the pipeline will create it automatically. You can provide the destination path (parent folder) in the destination parameter.

  • We prepared a markdown to explain the pipeline on DNAnexus and few considerations when you run the pipeline. The file name is nf-proteinfold.mdarrow-up-right

Last updated

Was this helpful?