Protein Data Bank (PDB) Curation
Necessary Disclaimers and Legal
Users are responsible for reviewing and complying with the license requirements of the software, notebooks, and data referenced in this documentation.
Users are responsible for compute and storage costs incurred within their DNAnexus project spaces.
Instance type availability and pricing are subject to the agreement between the user (or their organization) and DNAnexus.
Citations and Acknowledgments
This documentation references data and tools from the following resources:
RCSB Protein Data Bank (PDB) — a comprehensive structural biology database widely used in structural genomics and drug discovery.
The PDB query workflow adapts concepts from the TeachOpenCADD tutorial series.
The post-folding analysis notebook employs P2Rank, a machine learning–based method for ligand-binding pocket prediction.
Overview of the PDB query and processing notebook
This notebook provides a workflow for retrieving and organizing experimental protein–ligand structures from the Protein Data Bank (PDB). Users can define structural and experimental criteria to construct reproducible, curated datasets for downstream analysis.
The Notebook is available on the Platform: PDB_tutorial_python_query_and_processing_2026-04-08.ipynb . It is available here on AWS US East, AWS Europe (Frankfurt), AWS Europe (London), Azure Amsterdam, Azure US (West).
Workflow description:
The notebook:
Query the PDB using defined selection criteria
Retrieve and rank matching structures
Download selected complexes
Align structures and extract ligands for comparative analysis
The output is a curated and aligned set of experimental protein–ligand structures prepared for comparative structural analysis.
Running notebooks on the DNAnexus platform
Copying notebooks and snapshot into a Project
To use the notebooks, copy them into your project. Here are the steps to copy the notebooks into a project space:
Create a project for your analysis, billed to your own organization. Tutorials on how to set up a project can be found on this page.
Go to Resources Tab and find the project titled “Public Datasets AWS US (East)” and select the folder
“Protein_Data_Bank” (PDB query notebook)
“Notebook_snapshot”
Select notebooks and files in these three folders you want to copy. Please use snapshot: snapshot-molecular_modeling-jupyterlab-2026-04-08.tar.gz for environment setup
Select "Copy" on the top right menu, and select the project that you created in Step 1
Then, go to the project space you created in Step 1 to start exploring two notebooks.
To run the JupyterLab Notebooks, please see the JupyterLab section of the Academy Documentation.
Instance Type Selection
Instance wait times are subject to queue availability. Less common instance types may result in longer wait times due to their limited availability.
Instances started with snapshots may take longer to initialize due to environment setup.
Instance type availability and pricing are subject to the contract between the user or the user’s organization and DNAnexus.
The two notebooks are optimized for JupyterLab with Python, R, Stata, ML, Image Processing (version 2.11). If you do not have access, please contact the Success Team at [email protected] or the Sales Team at [email protected].
Recommended instance type for this demo: mem1_ssd1_v2_x16.
A note on notebooks
Use the snapshot when starting the job (e.g., snapshot-molecular_modeling-jupyterlab-2026-04-08.tar.gz). The snapshots can be found in the “Notebook_snapshot” folder under “Public Datasets AWS US (East)”.
Before running the notebooks, follow the instructions in the notebook markdown to select the correct kernel. If the required kernel is not available, activate the corresponding conda environment and register the kernel as described in the provided instructions.
If you would like to use this dataset in your own project, follow the section “Copying Notebooks and Snapshot into a Project”, and update the data path in the notebook accordingly. Alternatively, you may use the provided script to download the data directly from the Public Datasets AWS US (East) project.
Last updated