Protein Data Bank (PDB) Curation

Users are responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for compute and storage costs incurred within their DNAnexus project spaces.

Instance type availability and pricing are subject to the agreement between the user (or their organization) and DNAnexus.

Citations and Acknowledgments

This documentation references data and tools from the following resources:

Overview of the PDB query and processing notebook

This notebook provides a workflow for retrieving and organizing experimental protein–ligand structures from the Protein Data Bank (PDB). Users can define structural and experimental criteria to construct reproducible, curated datasets for downstream analysis.

The Notebook is available on the Platform: PDB_tutorial_python_query_and_processing_2026-04-08.ipynb . It is available here on AWS US East, AWS Europe (Frankfurt), AWS Europe (London), Azure Amsterdam, Azure US (West).

Workflow description:

The notebook:

  • Query the PDB using defined selection criteria

  • Retrieve and rank matching structures

  • Download selected complexes

  • Align structures and extract ligands for comparative analysis

The output is a curated and aligned set of experimental protein–ligand structures prepared for comparative structural analysis.

Running notebooks on the DNAnexus platform

Copying notebooks and snapshot into a Project

To use the notebooks, copy them into your project. Here are the steps to copy the notebooks into a project space:

  1. Create a project for your analysis, billed to your own organization. Tutorials on how to set up a project can be found on this page.

  2. Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder

    1. “Protein_Data_Bank” (PDB query notebook)

    2. “Notebook_snapshot”

  3. Select notebooks and files in these three folders you want to copy. Please use snapshot: snapshot-molecular_modeling-jupyterlab-2026-04-08.tar.gz for environment setup

  4. Select "Copy" on the top right menu, and select the project that you created in Step 1

  5. Then, go to the project space you created in Step 1 to start exploring two notebooks.

  6. To run the JupyterLab Notebooks, please see the JupyterLab section of the Academy Documentation.

Instance Type Selection

  • Instance wait times are subject to queue availability. Less common instance types may result in longer wait times due to their limited availability.

  • Instances started with snapshots may take longer to initialize due to environment setup.

  • Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.

  • The two notebooks are optimized for JupyterLab with Python, R, Stata, ML, Image Processing (version 2.11). If you do not have access, please contact the Success Team at [email protected] or the Sales Team at [email protected].

  • Recommended instance type for this demo: mem1_ssd1_v2_x16.

A note on notebooks

  • Use the snapshot when starting the job (e.g., snapshot-molecular_modeling-jupyterlab-2026-04-08.tar.gz). The snapshots can be found in the “Notebook_snapshot” folder under “Public Datasets AWS US (East)”.

  • Before running the notebooks, follow the instructions in the notebook markdown to select the correct kernel. If the required kernel is not available, activate the corresponding conda environment and register the kernel as described in the provided instructions.

  • If you would like to use this dataset in your own project, follow the section “Copying Notebooks and Snapshot into a Project”, and update the data path in the notebook accordingly. Alternatively, you may use the provided script to download the data directly from the Public Datasets AWS US (East) project.

Last updated