Python API tutorial#

Setup#

Installation#

plinder is available on PyPI.

pip install plinder

Environment variable configuration#

We need to set environment variables to point to the release and iteration of choice. For the sake of demonstration, this will be set to point to a smaller tutorial example dataset, which are PLINDER_RELEASE=2024-06 and PLINDER_ITERATION=tutorial.

Note

The version used for the preprint is PLINDER_RELEASE=2024-04 and PLINDER_ITERATION=v1, while the current version with updated annotations to be used for the MLSB challenge isPLINDER_RELEASE=2024-06 and PLINDER_ITERATION=v2.

import os
from pathlib import Path

release = "2024-06"
iteration = "tutorial"
os.environ["PLINDER_RELEASE"] = release
os.environ["PLINDER_ITERATION"] = iteration
os.environ["PLINDER_REPO"] =  str(Path.home()/"plinder-org/plinder")
os.environ["PLINDER_LOCAL_DIR"] =  str(Path.home()/".local/share/plinder")
os.environ["GCLOUD_PROJECT"] = "plinder"
version = f"{release}/{iteration}"

As alternative these variables could also be set from terminal via export (UNIX) or set (Windows).

Overview#

The user-facing subpackage of plinder is plinder.core. This provides access to the underlying utility functions for accessing the dataset, split and annotations. It provides access to five top-level functions:

In addition, it provides access to the data class PlinderSystem for reconstituting a PLINDER system from its system_id.

To supplement these data, plinder.core.scores provides functionality for querying metrics, such as protein/ligand similarity and cluster identity.

Getting the configuration#

At first we get the configuration to check that all parameters are correctly set. In the snippet below, we will check, if the local and remote PLINDER paths point to the expected location.

import plinder.core.utils.config

cfg = plinder.core.get_config()
print(f"local cache directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")
local cache directory: /home/runner/.local/share/plinder/2024-06/tutorial
remote data directory: gs://plinder/2024-06/tutorial

Query annotations#

Query specific columns#

To query the annotations table for specific columns or filter by specific criteria, use query_index(). The function could be called without any argument to yield a pandas dataframe of system_id and entry_pdb_id. However, the function could be called by passing columns argument, which is a list of column names.

from plinder.core.scores import query_index
# Get system_id and entry_pdb_id columns
query_index()
system_id entry_pdb_id
0 3grt__1__1.A_2.A__1.B 3grt
1 3grt__1__1.A_2.A__1.C 3grt
2 3grt__1__1.A_2.A__2.B 3grt
3 3grt__1__1.A_2.A__2.C 3grt
4 1grx__1__1.A__1.B 1grx
... ... ...
1357899 4lpn__1__10.A_24.A_3.A__24.X 4lpn
1357900 2lp3__1__1.A__1.C 2lp3
1357901 2lp3__1__1.A__1.D 2lp3
1357902 2lp3__1__1.B__1.E 2lp3
1357903 2lp3__1__1.B__1.F 2lp3

1357904 rows × 2 columns

# Get specific columns from the annotation table
cols_of_interest = ["system_id", "entry_pdb_id", "entry_release_date", "entry_oligomeric_state",
"entry_clashscore", "entry_resolution"]
query_index(columns=cols_of_interest)
system_id entry_pdb_id entry_release_date entry_oligomeric_state entry_clashscore entry_resolution
0 3grt__1__1.A_2.A__1.B 3grt 1997-02-12 dimeric 12.90 2.50
1 3grt__1__1.A_2.A__1.C 3grt 1997-02-12 dimeric 12.90 2.50
2 3grt__1__1.A_2.A__2.B 3grt 1997-02-12 dimeric 12.90 2.50
3 3grt__1__1.A_2.A__2.C 3grt 1997-02-12 dimeric 12.90 2.50
4 1grx__1__1.A__1.B 1grx 1993-10-01 monomeric NaN NaN
... ... ... ... ... ... ...
1357899 4lpn__1__10.A_24.A_3.A__24.X 4lpn 2013-07-16 24-meric 3.34 1.66
1357900 2lp3__1__1.A__1.C 2lp3 2012-01-31 dimeric NaN NaN
1357901 2lp3__1__1.A__1.D 2lp3 2012-01-31 dimeric NaN NaN
1357902 2lp3__1__1.B__1.E 2lp3 2012-01-31 dimeric NaN NaN
1357903 2lp3__1__1.B__1.F 2lp3 2012-01-31 dimeric NaN NaN

1357904 rows × 6 columns

Query annotations with specific filters#

We could also pass additional filters, where each filter is a logical comparison of a column name with some given value. Only those rows, that fulfill all conditions, are returned. See the description of [pandas.read_parquet()]https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html for more information on the filter syntax.

# Query for single-ligand systems
filters = [("system_num_ligand_chains", "==", "1")]
query_index(columns=cols_of_interest, filters=filters)
system_id entry_pdb_id entry_release_date entry_oligomeric_state entry_clashscore entry_resolution
0 3grt__1__1.A_2.A__1.B 3grt 1997-02-12 dimeric 12.90 2.50
1 3grt__1__1.A_2.A__1.C 3grt 1997-02-12 dimeric 12.90 2.50
2 3grt__1__1.A_2.A__2.B 3grt 1997-02-12 dimeric 12.90 2.50
3 3grt__1__1.A_2.A__2.C 3grt 1997-02-12 dimeric 12.90 2.50
4 1grx__1__1.A__1.B 1grx 1993-10-01 monomeric NaN NaN
... ... ... ... ... ... ...
809504 4lpn__1__10.A_24.A_3.A__24.X 4lpn 2013-07-16 24-meric 3.34 1.66
809505 2lp3__1__1.A__1.C 2lp3 2012-01-31 dimeric NaN NaN
809506 2lp3__1__1.A__1.D 2lp3 2012-01-31 dimeric NaN NaN
809507 2lp3__1__1.B__1.E 2lp3 2012-01-31 dimeric NaN NaN
809508 2lp3__1__1.B__1.F 2lp3 2012-01-31 dimeric NaN NaN

809509 rows × 6 columns

Note

To load all the columns, users can use the function get_plindex() which returns all the columns in the dataframe. However, since this table has over 1.3 million row and 500 columns, it has a significant memory footprint and users are advised to query only columns they need.

Query protein similarity#

The are three kinds of similarity datasets we provide:

  • Similarity between ligand bound structures (holo)

  • Similarity between ligand bound and unbound protein structures (apo)

  • Similarity between ligand bound and Alphafold predicted structures (pred) Any of these could be specified with query_protein_similarity()

Note

With the full dataset, some similarity queries might require a large amount of memory. For example, `query_protein_similarity(search_db=“holo”, filters=[(“similarity”, “>”, “50”)]) will use up >500G RAM.:::

Here, we will query protein similarity dataset to assess the protein-ligand interaction similarity between example training and test set

from plinder.core.scores import query_protein_similarity
# Example train systems
train = ["7jxf__1__1.A_1.B__1.G", "1jtu__1__1.A_1.B__1.C_1.D",
         "8f9d__2__1.C_1.D__1.G", "6a9a__1__1.A_2.A__2.C_2.D",
         "1b5e__2__1.A_1.B__1.D"]
# Example test systems
test = ["1b5d__1__1.A_1.B__1.D", "1s2g__1__1.A_2.C__1.D",
       "4agi__1__1.C__1.W", "4n7m__1__1.A_1.B__1.C",
         "7eek__1__1.A__1.I"]

metric = "pli_unique_qcov"
threshold = 50
query_protein_similarity(
        search_db="holo",
        columns=["query_system", "target_system", "similarity"],
        filters=[
                ("query_system", "in", test),
                ("target_system", "in", train),
                ("metric", "==", metric),
                ("similarity", ">=", str(threshold)),
            ],
)
query_system target_system similarity
0 1b5d__1__1.A_1.B__1.D 1b5e__2__1.A_1.B__1.D 83
1 1b5d__1__1.A_1.B__1.D 6a9a__1__1.A_2.A__2.C_2.D 83
2 1b5d__1__1.A_1.B__1.D 1jtu__1__1.A_1.B__1.C_1.D 67
3 1b5d__1__1.A_1.B__1.D 7jxf__1__1.A_1.B__1.G 67
4 4n7m__1__1.A_1.B__1.C 8f9d__2__1.C_1.D__1.G 50

Working with a PLINDER system#

A PlinderSystem is the representation of a single System. This object provides access to all PDB entry and system level annotations, as well as the structures of the system components.

Load systems from IDs#

To reconstitute PLINDER systems directly from a set of IDs use class PlinderSystem.

from plinder.core import PlinderSystem
plinder_system = PlinderSystem(system_id="4agi__1__1.C__1.W")

Users can choose the granularity level of input: In the cases above the systems were specified by their system ID, but as alternative passing PDB IDs (or their two middle characters) is also possible, which gives you all systems corresponding to the given PDB IDs.

Accessing annotations#

The PlinderSystem.entry property provides PDB entry-level annotations for that system. Here, we will list the accessible categories of entry annotations and access the oligomeric state of a given system.

entry_annotations = plinder_system.entry
print(list(entry_annotations.keys()))
print(entry_annotations["oligomeric_state"])
['pdb_id', 'release_date', 'oligomeric_state', 'determination_method', 'keywords', 'pH', 'resolution', 'chains', 'ligand_like_chains', 'systems', 'covalent_bonds', 'chain_to_seqres', 'validation', 'pass_criteria', 'water_chains', 'symmetry_mate_contacts']
dimeric

Instead, PlinderSystem.system returns annotations on the system level. Here, we will extract the SMILES string of the first ligand of a given system.

system_annotations = plinder_system.system
print(list(system_annotations.keys()))
# Show ligand smiles of the first ligand of a given system
print(system_annotations["ligands"][0]["smiles"])
['pdb_id', 'biounit_id', 'ligands', 'ligand_validation', 'pocket_validation', 'pass_criteria']
C[Se][C@@H]1O[C@@H](C)[C@@H](O)[C@@H](O)[C@@H]1O

Getting structure file paths#

The PlinderSystem also provides access to the structure files the system is based on. This could be helpful for loading the structures for training a model or performing other calculations that require structural information.

print(plinder_system.ligands)
{'1.W': '/home/runner/.local/share/plinder/2024-06/tutorial/systems/4agi__1__1.C__1.W/ligand_files/1.W.sdf'}

The same can be done for the receptor protein.

print(plinder_system.receptor_pdb)
/home/runner/.local/share/plinder/2024-06/tutorial/systems/4agi__1__1.C__1.W/receptor.pdb

Inspect apo and predicted annotations#

For users interested in using apo and predicted structures in model training, the snippet below maps holo system IDs (reference_system_id) to apo or predicted IDs (id) and reports their similarity measures as well. This similarity data includes protein and pocket similarity (see description here), as well as all evaluation metrics calculated upon superposition and transplantation of ligands into each apo/predicted structure. Another way to access the information directly wil be to use query_links() directly.

plinder_system.linked_structures
reference_system_id id pocket_fident pocket_lddt protein_fident_qcov_weighted_sum protein_fident_weighted_sum protein_lddt_weighted_sum target_id sort_score receptor_file ... posebusters_volume_overlap_with_inorganic_cofactors posebusters_volume_overlap_with_waters fraction_reference_proteins_mapped fraction_model_proteins_mapped lddt bb_lddt per_chain_lddt_ave per_chain_bb_lddt_ave filename kind
0 4agi__1__1.C__1.W 4uou_B 100.0 100.0 100.0 100.0 99.0 4uou 2.40 /plinder/2024-06/assignments/apo/4agi__1__1.C_... ... True True 1.0 1.0 0.972682 0.994065 0.965793 0.990783 /home/runner/.local/share/plinder/2024-06/tuto... apo
1 4agi__1__1.C__1.W 4uou_C 100.0 99.0 100.0 100.0 99.0 4uou 2.40 /plinder/2024-06/assignments/apo/4agi__1__1.C_... ... True True 1.0 1.0 0.973562 0.994687 0.966137 0.991653 /home/runner/.local/share/plinder/2024-06/tuto... apo
2 4agi__1__1.C__1.W 4uou_D 100.0 100.0 100.0 100.0 99.0 4uou 2.40 /plinder/2024-06/assignments/apo/4agi__1__1.C_... ... True True 1.0 1.0 0.973604 0.994235 0.966834 0.990844 /home/runner/.local/share/plinder/2024-06/tuto... apo
3 4agi__1__1.C__1.W 4uou_A 100.0 99.0 100.0 100.0 99.0 4uou 2.40 /plinder/2024-06/assignments/apo/4agi__1__1.C_... ... True True 1.0 1.0 0.967257 0.994800 0.961169 0.991704 /home/runner/.local/share/plinder/2024-06/tuto... apo
4 4agi__1__1.C__1.W Q4WW81_A 100.0 100.0 99.0 99.0 100.0 Q4WW81 98.57 /plinder/2024-06/assignments/pred/4agi__1__1.C... ... True True 1.0 1.0 0.982275 0.998587 0.977748 0.997611 /home/runner/.local/share/plinder/2024-06/tuto... pred

5 rows × 52 columns

Querying query_links() can be done directly via:

from plinder.core.scores import query_links
links = query_links()
links
reference_system_id id pocket_fident pocket_lddt protein_fident_qcov_weighted_sum protein_fident_weighted_sum protein_lddt_weighted_sum target_id sort_score receptor_file ... posebusters_volume_overlap_with_inorganic_cofactors posebusters_volume_overlap_with_waters fraction_reference_proteins_mapped fraction_model_proteins_mapped lddt bb_lddt per_chain_lddt_ave per_chain_bb_lddt_ave filename kind
0 6pl9__1__1.A__1.C 2vb1_A 100.0 86.0 100.0 100.0 96.0 2vb1 0.65 /plinder/2024-06/assignments/apo/6pl9__1__1.A_... ... True True 1.0 1.0 0.903772 0.968844 0.890822 0.959674 /home/runner/.local/share/plinder/2024-06/tuto... apo
1 6ahh__1__1.A__1.G 2vb1_A 100.0 98.0 100.0 100.0 95.0 2vb1 0.65 /plinder/2024-06/assignments/apo/6ahh__1__1.A_... ... True True 1.0 1.0 0.894349 0.962846 0.883217 0.954721 /home/runner/.local/share/plinder/2024-06/tuto... apo
2 5b59__1__1.A__1.B 2vb1_A 100.0 91.0 100.0 100.0 96.0 2vb1 0.65 /plinder/2024-06/assignments/apo/5b59__1__1.A_... ... True True 1.0 1.0 0.903266 0.962318 0.890656 0.955258 /home/runner/.local/share/plinder/2024-06/tuto... apo
3 3ato__1__1.A__1.B 2vb1_A 100.0 99.0 100.0 100.0 95.0 2vb1 0.65 /plinder/2024-06/assignments/apo/3ato__1__1.A_... ... True True 1.0 1.0 0.890530 0.954696 0.879496 0.946326 /home/runner/.local/share/plinder/2024-06/tuto... apo
4 6mx9__1__1.A__1.K 2vb1_A 100.0 98.0 100.0 100.0 95.0 2vb1 0.65 /plinder/2024-06/assignments/apo/6mx9__1__1.A_... ... True True 1.0 1.0 0.904116 0.964309 0.892434 0.955853 /home/runner/.local/share/plinder/2024-06/tuto... apo
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
597774 6x3q__1__1.A__1.B A8AWU7_A 100.0 79.0 99.0 99.0 88.0 A8AWU7 38.90 /plinder/2024-06/assignments/pred/6x3q__1__1.A... ... True True 1.0 1.0 0.815736 0.877814 0.806444 0.871054 /home/runner/.local/share/plinder/2024-06/tuto... pred
597775 8st5__1__1.A__1.B A8AWU7_A 100.0 95.0 99.0 99.0 88.0 A8AWU7 38.90 /plinder/2024-06/assignments/pred/8st5__1__1.A... ... True True 1.0 1.0 0.814876 0.885938 0.814176 0.881858 /home/runner/.local/share/plinder/2024-06/tuto... pred
597776 6efd__1__1.A__1.B A8AWU7_A 100.0 81.0 99.0 99.0 87.0 A8AWU7 38.90 /plinder/2024-06/assignments/pred/6efd__1__1.A... ... True True 1.0 1.0 0.814404 0.879823 0.810680 0.872417 /home/runner/.local/share/plinder/2024-06/tuto... pred
597777 8st6__1__1.A__1.D A8AWU7_A 100.0 80.0 99.0 99.0 88.0 A8AWU7 38.90 /plinder/2024-06/assignments/pred/8st6__1__1.A... ... True True 1.0 1.0 0.816566 0.884372 0.813010 0.877505 /home/runner/.local/share/plinder/2024-06/tuto... pred
597778 3rgu__3__1.C__1.E A1C3L3_A 100.0 94.0 99.0 100.0 91.0 A1C3L3 37.07 /plinder/2024-06/assignments/pred/3rgu__3__1.C... ... True True 1.0 1.0 0.860175 0.926758 0.849576 0.915247 /home/runner/.local/share/plinder/2024-06/tuto... pred

597779 rows × 52 columns

Here we will use this table to get the PDB and chain IDs for apo structures corresponding to a given system ID.

print(links[
    (links.reference_system_id ==  "4agi__1__1.C__1.W") & (links.kind == "apo")
].id.to_list())
['4uou_B', '4uou_C', '4uou_D', '4uou_A']

The structure file locations for the linked structures can also be obtained. The directory names are named after the reference_system_id and id column.

for file in plinder_system.linked_archive.glob("**/*.cif"):
    print(file)
/home/runner/.local/share/plinder/2024-06/tutorial/linked_structures/apo/4agi__1__1.C__1.W/4uou_C/superposed.cif
/home/runner/.local/share/plinder/2024-06/tutorial/linked_structures/apo/4agi__1__1.C__1.W/4uou_D/superposed.cif
/home/runner/.local/share/plinder/2024-06/tutorial/linked_structures/apo/4agi__1__1.C__1.W/4uou_A/superposed.cif
/home/runner/.local/share/plinder/2024-06/tutorial/linked_structures/apo/4agi__1__1.C__1.W/4uou_B/superposed.cif
/home/runner/.local/share/plinder/2024-06/tutorial/linked_structures/pred/4agi__1__1.C__1.W/Q4WW81_A/superposed.cif

Working with split data#

Get split table#

The split table sorts each PLINDER system into a cluster and defines the split it is part of. To access the splits, use get_split().

from plinder.core import get_split
split_df = get_split()
split_df
system_id uniqueness split cluster cluster_for_val_split system_pass_validation_criteria system_pass_statistics_criteria system_proper_num_ligand_chains system_proper_pocket_num_residues system_proper_num_interactions system_proper_ligand_max_molecular_weight system_has_binding_affinity system_has_apo_or_pred
0 101m__1__1.A__1.C_1.D 101m__A__C_D_c188899 train c14 c0 True True 1 27 20 616.177293 False False
1 102m__1__1.A__1.C 102m__A__C_c237197 train c14 c0 True True 1 26 20 616.177293 False True
2 103m__1__1.A__1.C_1.D 103m__A__C_D_c252759 train c14 c0 False True 1 26 16 616.177293 False False
3 104m__1__1.A__1.C_1.D 104m__A__C_D_c274687 train c14 c0 False True 1 27 21 616.177293 False False
4 105m__1__1.A__1.C_1.D 105m__A__C_D_c221688 train c14 c0 False True 1 28 20 616.177293 False False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
409721 9xia__1__2.A_4.A__4.B_4.D 9xia__A_A__B_D_c20731 train c256 c126 False False 1 23 6 178.084124 False False
409722 9xim__1__1.A_1.B__1.E_1.F_1.G 9xim__A_B__E_F_G_c240203 train c256 c126 False False 1 21 6 150.052823 False False
409723 9xim__1__1.A_1.B__1.H_1.I_1.J 9xim__A_B__H_I_J_c313183 train c256 c126 False False 1 19 5 150.052823 False False
409724 9xim__1__1.C_1.D__1.K_1.L_1.M 9xim__C_D__K_L_M_c215891 train c256 c126 False False 1 20 3 150.052823 False False
409725 9xim__1__1.C_1.D__1.N_1.O_1.P 9xim__C_D__N_O_P_c219610 train c256 c126 False False 1 20 6 150.052823 False False

408695 rows × 13 columns

For example this table can be used to get all system IDs that belong to the test split.

split_df[split_df.split == "test"].system_id.to_list()
['1b5d__1__1.A_1.B__1.D',
 '1s2g__1__1.A_2.C__1.D',
 '4agi__1__1.C__1.W',
 '4n7m__1__1.A_1.B__1.C',
 '7eek__1__1.A__1.I']