> For the complete documentation index, see [llms.txt](https://academy.dnanexus.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://academy.dnanexus.com/public-datasets-on-the-dnanexus-platform/molecular-modeling/protein-data-bank-pdb-curation.md).

# Protein Data Bank (PDB) Curation

## Necessary Disclaimers and Legal

Users are responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.

Users are responsible for compute and storage costs incurred within their DNAnexus project spaces.

Instance type availability and pricing are subject to the agreement between the user (or their organization) and DNAnexus.

## Citations and Acknowledgments

This documentation references data and tools from the following resources:

* [RCSB Protein Data Bank (PDB)](https://www.rcsb.org/) — a comprehensive structural biology database widely used in structural genomics and drug discovery.
* The PDB query workflow adapts concepts from the [TeachOpenCADD tutorial series](https://projects.volkamerlab.org/teachopencadd/talktorials/T008_query_pdb.html).
* The notebooks utilize Python packages including [biotite](https://www.biotite-python.org/latest/tutorial/index.html), [pypdb](https://academic.oup.com/bioinformatics/article/32/1/159/1743800), and [py3Dmol](https://github.com/avirshup/py3dmol).
* The post-folding analysis notebook employs [P2Rank](https://link.springer.com/article/10.1186/s13321-018-0285-8), a machine learning–based method for ligand-binding pocket prediction.

## Overview of the PDB query and processing notebook

This notebook provides a workflow for retrieving and organizing experimental protein–ligand structures from the Protein Data Bank (PDB). Users can define structural and experimental criteria to construct reproducible, curated datasets for downstream analysis.

The Notebook is available on the Platform: PDB\_tutorial\_python\_query\_and\_processing\_2026-04-08.ipynb .  It is available here on [AWS US East](https://platform.dnanexus.com/panx/projects/J3JyY6j030gzQypGpk273241/data/Protein_Data_Bank), [AWS Europe (Frankfurt)](https://platform.dnanexus.com/panx/projects/J780j7848VpfB6kJ8p7y29xG/data/Protein_Data_Bank), [AWS Europe (London)](https://platform.dnanexus.com/panx/projects/J780fzpKpb7Gq5X4ZJfBP7QX/data/Protein_Data_Bank), [Azure Amsterdam,](https://platform.dnanexus.com/panx/projects/J780gY0B34pvq5X4ZJfBP7YP/data/Protein_Data_Bank) [Azure US (West)](https://platform.dnanexus.com/panx/projects/J780v289Z00G4Kx14b188ybj/data/Protein_Data_Bank).

### Workflow description:

The notebook:

* Query the PDB using defined selection criteria
* Retrieve and rank matching structures
* Download selected complexes
* Align structures and extract ligands for comparative analysis

The output is a curated and aligned set of experimental protein–ligand structures prepared for comparative structural analysis.

## Running notebooks on the DNAnexus platform

### Copying notebooks and snapshot into a Project&#x20;

To use the notebooks, copy them into your project. Here are the steps to copy the notebooks into a project space:

1. Create a project for your analysis, billed to your own organization. Tutorials on how to set up a project can be found on this page.
2. Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder&#x20;
   1. “Protein\_Data\_Bank” (PDB query notebook)
   2. “Notebook\_snapshot”
3. Select notebooks and files in these three folders you want to copy. Please use snapshot: **snapshot-molecular\_modeling-jupyterlab-2026-04-08.tar.gz** for environment setup
4. Select "Copy" on the top right menu, and select the project that you created in Step 1
5. Then, go to the project space you created in Step 1 to start exploring two notebooks.
6. To run the JupyterLab Notebooks, please see the [JupyterLab section of the Academy Documentation](https://academy.dnanexus.com/interactivecloudcomputing/jupyterlab).

### Instance Type Selection

* Instance wait times are subject to queue availability. Less common instance types may result in longer wait times due to their limited availability.
* Instances started with snapshots may take longer to initialize due to environment setup.
* Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.&#x20;
* The two notebooks are optimized for [JupyterLab with Python, R, Stata, ML, Image Processing](https://academy.dnanexus.com/interactivecloudcomputing/jupyterlab) (version 2.11). If you do not have access, please contact the Success Team at <success@dnanexus.com> or the Sales Team at <sales@dnanexus.com>.
* Recommended instance type for this demo: mem1\_ssd1\_v2\_x16.

### A note on notebooks

* Use the snapshot when starting the job (e.g., snapshot-molecular\_modeling-jupyterlab-2026-04-08.tar.gz). The snapshots can be found in the “Notebook\_snapshot” folder under “Public Datasets AWS US (East)”.
* Before running the notebooks, follow the instructions in the notebook markdown to select the correct kernel. If the required kernel is not available, activate the corresponding conda environment and register the kernel as described in the provided instructions.
* If you would like to use this dataset in your own project, follow the section “Copying Notebooks and Snapshot into a Project”, and update the data path in the notebook accordingly. Alternatively, you may use the provided script to download the data directly from the Public Datasets AWS US (East) project.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://academy.dnanexus.com/public-datasets-on-the-dnanexus-platform/molecular-modeling/protein-data-bank-pdb-curation.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
