The PLINDEX#

import matplotlib.pyplot as plt
import mols2grid
import numpy as np

Querying and filtering the index#

Your main entry point to the dataset is the annotations table or the index, which is a Parquet file containing all the annotations for each system in the dataset.

Note

The entire file has 745 columns, >1.3M rows, and takes 24G of RAM to load into memory :::

We provide a query_index function to access the index and filter it based on the columns and criteria you need.

from plinder.core.scores import query_index

The core of the PLINDER dataset is a collection of Protein-Ligand Interaction (PLI) systems extracted from the Protein Data Bank (PDB). The curation process in a nutshell is as follows:

  1. For every PDB entry, we generate all available biological assemblies (biounits).

  2. For each biounit, we identify all ligands and all protein chains within 6 Å of any ligand.

  3. Ligands within 4 Å of each other are merged into a single PLI system.

  4. For each system, we provide a range of detailed annotations and files to facilitate in-depth analysis and to enable a variety of use-cases.

Thus, a PLI system is defined as a collection of protein and ligand chains that are in close proximity to each other within a given biounit of a given PDB entry. The pocket of a system is defined as the set of protein residues within 6 Å of the ligands in the system.

Let’s look at all the columns that define a system:

plindex = query_index(
    columns=[
        "system_id",
        "entry_pdb_id",
        "system_biounit_id",
        "system_protein_chains_asym_id",
        "system_ligand_chains_asym_id",
        "ligand_instance_chain",
    ]
)
plindex.head(20)
system_id entry_pdb_id system_biounit_id system_protein_chains_asym_id system_ligand_chains_asym_id ligand_instance_chain split
0 3grt__1__1.A_2.A__1.B 3grt 1 [1.A, 2.A] [1.B] 1.B train
1 3grt__1__1.A_2.A__1.C 3grt 1 [1.A, 2.A] [1.C] 1.C train
2 3grt__1__1.A_2.A__2.B 3grt 1 [1.A, 2.A] [2.B] 2.B train
3 3grt__1__1.A_2.A__2.C 3grt 1 [1.A, 2.A] [2.C] 2.C train
4 1grx__1__1.A__1.B 1grx 1 [1.A] [1.B] 1.B train
5 6grf__1__1.A__1.D 6grf 1 [1.A] [1.D] 1.D train
6 6grf__2__1.B__1.E 6grf 2 [1.B] [1.E] 1.E train
7 3grj__2__1.B__1.I_1.L_1.Q 3grj 2 [1.B] [1.I, 1.L, 1.Q] 1.I train
8 3grj__2__1.B__1.I_1.L_1.Q 3grj 2 [1.B] [1.I, 1.L, 1.Q] 1.L train
9 3grj__2__1.B__1.I_1.L_1.Q 3grj 2 [1.B] [1.I, 1.L, 1.Q] 1.Q train
10 3grj__1__1.A__1.F 3grj 1 [1.A] [1.F] 1.F train
11 3grj__1__1.A__1.H 3grj 1 [1.A] [1.H] 1.H train
12 8grn__1__1.A__1.C 8grn 1 [1.A] [1.C] 1.C train
13 8grn__1__1.B__1.D 8grn 1 [1.B] [1.D] 1.D train
14 4grc__1__1.A__1.C_1.D 4grc 1 [1.A] [1.C, 1.D] 1.C train
15 4grc__1__1.A__1.C_1.D 4grc 1 [1.A] [1.C, 1.D] 1.D train
16 4grc__1__2.A__2.C_2.D 4grc 1 [2.A] [2.C, 2.D] 2.C train
17 4grc__1__2.A__2.C_2.D 4grc 1 [2.A] [2.C, 2.D] 2.D train
18 4grr__1__1.B_1.C__1.K 4grr 1 [1.B, 1.C] [1.K] 1.K train
19 2grt__1__1.A_2.A__1.B 2grt 1 [1.A, 2.A] [1.B] 1.B train

Thus, a system is uniquely qualified by its system_id which is a combination of

  • entry_pdb_id - the PDB identifier

  • system_biounit_id - the biological assembly identifier

  • system_protein_chains_asym_id - The list of protein chains in the system defined by their <instance>.<label_asym_id>

  • system_ligand_chains_asym_id - The list of ligand chains in the system defined by their <instance>.<label_asym_id>

Systems containing multiple ligands (e.g 4grc__1__1.A__1.C_1.D) span multiple rows, where each row represents a different ligand in the system (as seen in the ligand_instance_chain column).

print(f"Number of ligand chains: {plindex.shape[0]}")
print(f"Number of systems: {plindex.system_id.nunique()}")
print(
    f"Number of biounits: {plindex[['entry_pdb_id', 'system_biounit_id']].drop_duplicates().shape[0]}"
)
print(f"Number of PDB IDs: {plindex.entry_pdb_id.nunique()}")
Number of ligand chains: 419538
Number of systems: 309972
Number of biounits: 108206
Number of PDB IDs: 77474

While query_index by default loads all the systems in the train and val splits, not every returned system may be useful for training your model.

Single-ligand single-protein predictors#

As an example, we consider the case of training single-ligand single-protein models. One way to filter training data is as follows:

plindex_single = query_index(
    filters=[
        ("system_num_ligand_chains", "==", 1),
        ("system_num_protein_chains", "==", 1),
    ],
    splits=["train"],
)
plindex_single.system_id.nunique()
152822

However, as PLINDER also considers ions and crystallization artifacts as ligands if they are within 4 Å of a non-ion non-artifact ligand, there are also systems in PLINDER which only have one “proper” ligand. So, another strategy would be to only train on the proper ligands and ignore the ions and artifacts in the system.

plindex_single_proper = query_index(
    filters=[
        ("system_proper_num_ligand_chains", "==", 1),
        ("system_num_protein_chains", "==", 1),
        ("ligand_is_proper", "==", True),  # filters out all other ligands in the system
    ],
    splits=["train"],
)
plindex_single_proper.system_id.nunique()
182360

This can provide up to 20% more data for training, however, the caveat is that some of the interactions made by artifacts or ions may influence the binding pose of the “proper” ligand.

One could also choose to include multi-ligand systems but only train with one ligand at a time, and the same for multi-protein. These choices are up to the user and we provide the annotations to enable such choices.

Annotations#

There are 745 columns in the full index. Of course, not every one will be useful for your use-case so we’ll go through some common categories of them and some use-cases requiring different annotations. You can get the full list of columns with descriptions in the index docs.

Structure quality#

A core principle of PLINDER is to be able to annotate which systems are of high enough experimental structure quality to be reliably used as the ground truth for measuring model performance. As the quality of experimentally resolved structures can vary significantly and many crystal structures with ligands contain missing atoms or missing residues in the binding site, comparing prediction results to lower quality structures can incorrectly skew the perception of their performance.

quality

You can access all the crystal structure validation information extracted from PDB validation reports as well as crystal contact annotations by looking for columns starting with “entry_validation”, “system_pocket_validation”, “system_ligand_validation” etc.

For simplicity, we have “system_pass_validation_criteria” as a column that can be used to filter systems which pass our quality definitions:

class QualityCriteria:
    max_entry_resolution: float = 3.5
    max_entry_r: float = 0.4
    max_entry_rfree: float = 0.45
    max_entry_r_minus_rfree: float = 0.05
    ligand_max_num_unresolved_heavy_atoms: int = 0  # except for covalent
    ligand_max_alt_count: int = 1  # misnomer: this counts number of total configurations
    ligand_min_average_occupancy: float = 0.8
    ligand_min_average_rscc: float = 0.8
    ligand_max_average_rsr: float = 0.3
    ligand_max_percent_outliers_clashes: float = 0
    ligand_max_fraction_atoms_with_crystal_contacts: float = 0
    pocket_max_num_unresolved_heavy_atoms: int = 0
    pocket_max_alt_count: int = 1  # same as above
    pocket_min_average_occupancy: float = 0.8
    pocket_min_average_rscc: float = 0.8
    pocket_max_average_rsr: float = 0.3
    pocket_max_percent_outliers_clashes: int = 100
plindex = query_index(
    columns=["system_id", "system_pass_validation_criteria"],
    splits=["train", "val", "test"],
)
data = (
    plindex.drop_duplicates("system_id")
    .sort_values(by="system_pass_validation_criteria")
    .groupby(["split", "system_pass_validation_criteria"])
    .system_id.count()
    .unstack()
)
data_percentage = data.div(data.sum(axis=1), axis=0) * 100
ax = data_percentage.plot(
    kind="bar", stacked=True, figsize=(5, 3), color=["firebrick", "darkseagreen"]
)
ax.set_xticklabels(
    [label.get_text().upper() for label in ax.get_xticklabels()], rotation=0
)
ax.set_xlabel("")
ax.set_ylabel("% passing criteria")
ax.get_legend().remove()
for container, count_data in zip(ax.containers, data.values.T):
    ax.bar_label(container, labels=count_data, label_type="edge")
../_images/3ec758abdafd38ced6452107a4766f65871b0118d03ea0ab8e2fde80192fb113.png

Structure completeness#

Related to structure quality, one aspect that is important to consider when using these structures in your training is completeness. While the inputs for prediction may be a protein sequence and a SMILES string, the protein-ligand complex structure that you get from the PDB may not have all the residues in the protein or all the atoms in the ligand resolved.

plindex = query_index(
    columns=[
        "system_id",
        "ligand_num_unresolved_heavy_atoms",
        "ligand_num_heavy_atoms",
        "system_protein_chains_num_unresolved_residues",
        "system_pocket_validation_num_unresolved_heavy_atoms",
        "system_protein_chains_total_length",
    ],
    splits=["train", "val", "test"],
)

Here we calculate the fraction of unresolved residues in the protein chains and the fraction of unresolved heavy atoms in the ligand, to see their distributions across the splits.

plindex[
    "system_protein_chains_total_num_unresolved_residues"
] = plindex.system_protein_chains_num_unresolved_residues.map(sum)
plindex["system_protein_chains_fraction_unresolved_residues"] = (
    plindex.system_protein_chains_total_num_unresolved_residues
    / plindex.system_protein_chains_total_length
)
plindex["ligand_fraction_unresolved_heavy_atoms"] = (
    plindex.ligand_num_unresolved_heavy_atoms / plindex.ligand_num_heavy_atoms
)
fig, axes = plt.subplots(1, 3, figsize=(10, 3), sharex=True, sharey=True)
grouped_data = plindex.drop_duplicates("system_id").groupby("split")
split_colors = {
    "train": "#ff9999",
    "test": "#66b3ff",
    "val": "#7cc27c",
}
for i, (split, data) in enumerate(grouped_data):
    data.system_protein_chains_fraction_unresolved_residues.hist(
        ax=axes[i], density=True, color=split_colors[split]
    )
    axes[i].set_title(split.upper())
    if i == 0:
        axes[i].set_ylabel("Density")
fig.suptitle(
    "Fraction of Unresolved Protein Residues (SEQRES length - no. resolved residues)"
)
plt.tight_layout()
../_images/a965c26e6b48a3635cf1a6a7ec06f6618ac8f018bca63f0781e1fca0f2158f38.png
print("Percentage of ligands with no unresolved heavy atoms:")
for split in ["train", "val", "test"]:
    print(
        f'{split.capitalize()}: {100 * (plindex[plindex["split"] == split].ligand_fraction_unresolved_heavy_atoms == 0).sum() / plindex[plindex["split"] == split].shape[0]:.2f}%'
    )
Percentage of ligands with no unresolved heavy atoms:
Train: 82.34%
Val: 90.75%
Test: 93.80%

Pocket domains#

We annotated domains from different databases onto the protein chains of each system and then picked the one spanning the pocket residues of the system as the domain of the system pocket.

plindex = query_index(
    columns=["system_id", "system_pocket_ECOD_t_name", "system_pocket_kinase_name"],
)
plindex.drop_duplicates("system_id").system_pocket_ECOD_t_name.value_counts().head(10)
system_pocket_ECOD_t_name
Nucleoplasmin-like/VP (viral coat and capsid proteins)         14051
P-loop containing nucleoside triphosphate hydrolases           13463
Protein kinase                                                 11203
TIM barrels                                                     9384
Immunoglobulin/Fibronectin type III/E set domains/PapD-like     9293
NAD(P)-binding Rossmann-fold domains                            9213
Globin-like                                                     7218
Ribosomal protein L31e/gp120 outer domain                       7141
S2 subunit of coronavirus spike glycoprotein                    6701
Viral protein domain                                            6467
Name: count, dtype: int64
plindex.drop_duplicates("system_id").system_pocket_kinase_name.value_counts().head(10)
system_pocket_kinase_name
CDK2        494
p38a        413
EGFR        360
CK2a1       349
AurA        278
PIM1        211
ALK2        208
JAK2        191
IRAK4       186
MAPKAPK2    183
Name: count, dtype: int64

Ligand properties#

Molecular properties and annotations are calculated from the ligand SMILES strings

properties = [
    "ligand_molecular_weight",
    "ligand_num_rot_bonds",
    "ligand_num_rings",
    "ligand_num_hbd",
    "ligand_num_hba",
    "ligand_num_heavy_atoms",
    "ligand_crippen_clogp",
    "ligand_qed",
    "ligand_tpsa",
    "ligand_is_kinase_inhibitor",
]
plindex = query_index(
    columns=["system_id", "ligand_instance_chain", "ligand_unique_ccd_code"]
    + properties,
    splits=["train", "val", "test"],
    filters=[
        ("ligand_is_proper", "==", True)  # focusing on non-ion, non-artifact ligands
    ],
)
plindex.head(20)
system_id ligand_instance_chain ligand_unique_ccd_code ligand_molecular_weight ligand_num_rot_bonds ligand_num_rings ligand_num_hbd ligand_num_hba ligand_num_heavy_atoms ligand_crippen_clogp ligand_qed ligand_tpsa ligand_is_kinase_inhibitor split
0 3grt__1__1.A_2.A__1.B 1.B FAD 785.157135 13 6 9 21 53 -2.42396 0.044261 362.93 False train
1 3grt__1__1.A_2.A__1.C 1.C TS2 721.288731 10 1 11 13 48 -4.04010 0.095714 313.27 False train
2 3grt__1__1.A_2.A__2.B 2.B FAD 785.157135 13 6 9 21 53 -2.42396 0.044261 362.93 False train
3 3grt__1__1.A_2.A__2.C 2.C TS2 721.288731 10 1 11 13 48 -4.04010 0.095714 313.27 False train
4 1grx__1__1.A__1.B 1.B GSH 307.083806 9 0 6 6 20 -2.20610 0.263437 158.82 True train
5 6grf__1__1.A__1.D 1.D HSR 221.089937 2 1 5 6 15 -3.07760 0.337239 119.25 False train
6 6grf__2__1.B__1.E 1.E HSR 221.089937 2 1 5 6 15 -3.07760 0.337239 119.25 False train
7 3grj__2__1.B__1.I_1.L_1.Q 1.L G14 188.058577 2 2 2 2 14 1.77490 0.754340 65.98 False train
8 3grj__1__1.A__1.F 1.F G14 188.058577 2 2 2 2 14 1.77490 0.754340 65.98 False train
9 3grj__1__1.A__1.H 1.H G14 188.058577 2 2 2 2 14 1.77490 0.754340 65.98 False train
10 8grn__1__1.A__1.C 1.C LPC 468.308466 21 0 2 6 31 4.43140 0.111161 102.29 False train
11 8grn__1__1.B__1.D 1.D LPC 468.308466 21 0 2 6 31 4.43140 0.111161 102.29 False train
12 4grc__1__1.A__1.C_1.D 1.C HEM 616.177293 8 6 2 6 43 4.73854 0.416598 96.26 False train
13 4grc__1__2.A__2.C_2.D 2.C HEM 616.177293 8 6 2 6 43 4.73854 0.416598 96.26 False train
14 4grr__1__1.B_1.C__1.K 1.K AVR 231.137162 3 2 1 4 17 1.98760 0.821636 60.39 False train
15 2grt__1__1.A_2.A__1.B 1.B FAD 785.157135 13 6 9 21 53 -2.42396 0.044261 362.93 False train
16 2grt__1__1.A_2.A__1.C 1.C GDS 612.151962 21 0 10 12 40 -3.87680 0.043742 317.64 False train
17 2grt__1__1.A_2.A__2.B 2.B FAD 785.157135 13 6 9 21 53 -2.42396 0.044261 362.93 False train
18 2grt__1__1.A_2.A__2.C 2.C GDS 612.151962 21 0 10 12 40 -3.87680 0.043742 317.64 False train
19 7grp__1__1.A__1.C 1.C VXQ 149.084064 1 2 1 2 11 1.08020 0.648103 35.25 False train
binwidths = {
    "ligand_molecular_weight": ("Molecular weight", 50),
    "ligand_num_heavy_atoms": ("Number of heavy atoms", 5),
    "ligand_num_rot_bonds": ("Number of rotatable bonds", 2),
    "ligand_num_rings": ("Number of rings", 1),
    "ligand_tpsa": ("Topological polar surface area", 10),
    "ligand_crippen_clogp": ("Crippen logP", 1),
}
fig, axes = plt.subplots(3, 2, figsize=(10, 7))
axes = axes.flatten()
for i, prop in enumerate(binwidths):
    ax = axes[i]
    plindex.groupby("split")[prop].plot(
        kind="hist",
        density=True,
        alpha=0.5,
        histtype="stepfilled",
        legend=True,
        ax=ax,
        color=split_colors,
        bins=np.arange(
            min(plindex[prop]),
            max(plindex[prop]) + binwidths[prop][1],
            binwidths[prop][1],
        ),
    )
    ax.set_xlabel(binwidths[prop][0])
    ax.set_ylabel("Density")
plt.tight_layout()
plt.show()
../_images/76687e8baa9e997f2415bf58e7334adb98430a159e2900d44a68fa1934a17738.png

These properties were further used to categorize the ligands into different types.

ligand_types = [
    f"ligand_is_{x}"
    for x in [
        "lipinski",
        "covalent",
        "cofactor",
        "oligo",
        "ion",
        "fragment",
        "artifact",
    ]
]

These categories were also aggregated to the system level for easier filtering, with system_has_<category> as a column indicating whether any ligand in the system is of that category.

plindex = query_index(
    columns=["system_id", "ligand_unique_ccd_code"] + ligand_types,
    splits=["train", "val", "test"],
)
labels = [c.replace("ligand_is_", "").capitalize() for c in ligand_types]
bar_colors = plt.cm.Pastel2.colors
split_names = ["train", "val", "test"]
fig, axes = plt.subplots(1, 3, figsize=(10, 3))
for i, split in enumerate(split_names):
    ax = axes[i]
    split_data = plindex[plindex["split"] == split]
    bars = ax.bar(
        labels,
        split_data[ligand_types].mean().mul(100).to_list(),
        width=1,
        color=bar_colors,
        edgecolor="black",
        label=split,
        linewidth=1,
    )
    ax.set_xticks(np.arange(len(labels)), labels, rotation=70)
    ax.set_ylim(0, 100)
    ax.set_title(split.upper())
    counts = split_data[ligand_types].sum().to_list()
    for bar, count in zip(bars, counts):
        ax.text(
            bar.get_x() + bar.get_width() / 2.0,
            bar.get_height() + 2,
            f"{count}",
            ha="center",
            va="bottom",
            rotation=70,
            fontsize=10,
        )
../_images/35d3c4d24df0df6ffb5caf37fbc9fade04e40b7bbdea87a34190e4916343eebb.png

PLI-specific properties#

There are also some ligand properties that are specific to the interaction with the particular protein pocket present in the system. These include things like:

  • The kinds of interactions the ligand may have with the protein, calculated using PLIP

  • Experimental binding affinity, pulled from BindingDB when available

  • Potential ligand-protein crystal contacts, defined as ligand-protein contacts below 5 Å which are not in the same asymmetric unit (symmetry mates) and not in the system biounit

pli_specific = [
    "system_fraction_atoms_with_crystal_contacts",
    "system_num_crystal_contacted_residues",
    "ligand_binding_affinity",
    "system_has_binding_affinity",
    "ligand_interactions",
    "system_num_interactions",
]
plindex = query_index(
    columns=["system_id"] + pli_specific,
    splits=["train", "val", "test"],
)

You could filter out systems with crystal contacts:

(
    plindex.drop_duplicates("system_id").system_fraction_atoms_with_crystal_contacts > 0
).sum()
34467

If your model has an additional component for predicting binding affinity, you could see how much data is available for training and evaluation:

plindex.drop_duplicates("system_id").groupby(
    "split"
).system_has_binding_affinity.value_counts()
split  system_has_binding_affinity
test   False                             862
       True                              174
train  False                          261204
       True                            47936
val    False                             685
       True                              147
Name: count, dtype: int64
plindex["ligand_binding_affinity"].hist()
plt.xlabel(r"$pK_i$ or $pK_D$")
plt.ylabel("No. of system ligands")
Text(0, 0.5, 'No. of system ligands')
../_images/1fd3a63764c5256bc8790b25e6bfd94ffd251e4dd5aae0edba396e8eb7e44017.png

You could even look deeper into the kinds of protein-ligand interactions present in your training data:

plindex.groupby("split").system_num_interactions.describe()
count mean std min 25% 50% 75% max
split
test 1436.0 16.591226 9.886007 3.0 9.0 15.0 22.0 51.0
train 418381.0 14.448139 12.087511 0.0 5.0 12.0 21.0 722.0
val 1157.0 16.078652 10.204647 0.0 8.0 14.0 22.0 49.0
plindex["ligand_interactions"].values[0]
array(['1.A_36_type:hydrogen_bonds__protisdon:False__sidechain:False',
       '1.A_113_type:hydrogen_bonds__protisdon:False__sidechain:False',
       '1.A_113_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_312_type:hydrogen_bonds__protisdon:False__sidechain:False',
       '1.A_320_type:hydrogen_bonds__protisdon:False__sidechain:False',
       '1.A_322_type:hydrogen_bonds__protisdon:False__sidechain:True',
       '1.A_322_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_40_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_40_type:hydrophobic_contacts',
       '1.A_13_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_13_type:hydrogen_bonds__protisdon:True__sidechain:True',
       '1.A_11_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_35_type:hydrogen_bonds__protisdon:True__sidechain:True',
       '1.A_34_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_49_type:hydrogen_bonds__protisdon:True__sidechain:True',
       '1.A_180_type:hydrogen_bonds__protisdon:True__sidechain:True',
       '1.A_180_type:pi_stacks__stack_type:T',
       '1.A_41_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_14_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_15_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_140_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_314_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '2.A_450_type:hydrogen_bonds__protisdon:False__sidechain:False'],
      dtype=object)

Beyond the split#

Here we mainly focused on systems in the train/val/test splits as the train and val splits are the ones that can be used for training models to either participate in the PLINDER leaderboard or compare your models’ performance to methods in the leaderboard.

However, PLINDER itself contains all PLI systems in the PDB (except those containing only crystallization artifacts), and you may be interested in this dataset as a whole. This can be queried with the splits=["*"] option.

plindex = query_index(
    columns=[
        "system_id",
        "entry_pdb_id",
        "system_num_protein_chains",
        "system_num_ligand_chains",
    ]
    + ligand_types,
    splits=["*"],
)
print("No. ligands:", len(plindex))
print("No. systems:", plindex.system_id.nunique())
print("No. PDB entries:", plindex.entry_pdb_id.nunique())
No. ligands: 1357906
No. systems: 990260
No. PDB entries: 143800
fig, ax = plt.subplots(figsize=(5, 3))
bars = ax.bar(
    labels,
    plindex[ligand_types].mean().mul(100).to_list(),
    width=1,
    color=bar_colors,
    edgecolor="black",
    label=split,
    linewidth=1,
)
ax.set_xticks(np.arange(len(labels)), labels, rotation=70)
ax.set_ylim(0, 100)
ax.set_title(f"No. PLINDER ligands: {len(plindex)}")
counts = plindex[ligand_types].sum().to_list()
for bar, count in zip(bars, counts):
    ax.text(
        bar.get_x() + bar.get_width() / 2.0,
        bar.get_height() + 2,
        f"{count}",
        ha="center",
        va="bottom",
        rotation=70,
        fontsize=10,
    )

print(plindex.system_num_protein_chains.value_counts())
print(plindex.system_num_ligand_chains.value_counts())
system_num_protein_chains
1     857629
2     312129
3      73237
4      19875
30     14122
       ...  
72        72
64        64
53        61
52        52
29        30
Name: count, Length: 74, dtype: int64
system_num_ligand_chains
1     809509
2     248916
3     107613
4      45856
5      17065
       ...  
84        84
39        78
74        74
67        67
46        46
Name: count, Length: 238, dtype: int64
../_images/3861b7b831ef060f4df345d8781465764ad1ea95c39dbb2a6bfc77530c688df7.png

When looking at distributions across the entire dataset, you’ll notice that there are quite a lot of ion systems, as well as systems with more than 5 protein or ligand chains.

Note

While we provide annotations for all systems, those which are not in the train and val splits may not be used for training models to participate in the MLSB PLINDER leaderboard. :::

Similarity clusters#

Similarity between two protein-ligand complexes can occur at various levels, including protein sequence, structural features, binding pocket characteristics, or ligand and interaction properties. We calculated a comprehensive set of similarity metrics to cover every combination of these dimensions — from identical systems, where the protein, pocket, interactions and ligand are the same, to systems that differ across all levels.

The entire set of similarity metrics is described here, and they were calculated across all pairs of systems having a Foldseek or MMseqs alignment. We then used graph clustering to group systems into clusters based on their similarity with a specific metric and threshold. There are three types of clusters available: strongly connected graph components (strong component), weakly connected graph components (weak component), and communities detected using asynchronous label propagation (community).

We can load the similarity clusters assigned to each system for different metrics, thresholds, and clustering types.

These cluster labels can be used in many ways with the annotations.

Example: ATP-binding pockets#

For example, maybe you are interested in seeing how many different kinds of ATP-binding pockets we have. First, let’s find all the analogs of ATP. We can do this by finding the 95% ECFP4 tanimoto similarity component of ATP and then getting the CCD codes of all the ligands in the same cluster:

ligand_cluster_column = "tanimoto_similarity_max__95__strong__component"
atp_ligand_cluster = query_index(
    columns=[
        ligand_cluster_column,
    ],
    filters=[
        ("ligand_unique_ccd_code", "==", "ATP"),
        ("system_num_ligand_chains", "==", 1),
    ],
)[ligand_cluster_column].values[0]

atp_analogs = query_index(
    columns=[
        "system_id",
        "entry_pdb_id",
        "ligand_unique_ccd_code",
        "ligand_rdkit_canonical_smiles",
    ],
    filters=[
        (ligand_cluster_column, "==", atp_ligand_cluster),
        ("system_num_ligand_chains", "==", 1),
    ],
).drop_duplicates("ligand_unique_ccd_code")
atp_analogs_code_set = set(atp_analogs.ligand_unique_ccd_code.unique())
atp_analogs_code_set
{'5FA', 'ADP', 'AQP', 'ATP', 'ZF9', 'ZSF'}

Let’s see how they look

grid = mols2grid.MolGrid(atp_analogs, smiles_col="ligand_rdkit_canonical_smiles")
grid.display(subset=["ligand_unique_ccd_code", "img"])

Now we can get the pocket clusters of all systems containing ATP analogs. Here we’re using pocket_qcov__50__weak__component, meaning a system within a cluster has some other system in that cluster with which it shares at least 50% of pocket residues when aligned.

pocket_cluster_column = "pocket_qcov__50__weak__component"
plindex = query_index(
    columns=[
        "system_id",
        "entry_pdb_id",
        "entry_release_date",
        pocket_cluster_column,
        "system_pocket_ECOD_t_name",
    ],
    filters=[
        (
            "ligand_unique_ccd_code",
            "in",
            atp_analogs_code_set,
        ),  # only ATP-binding systems
        ("ligand_num_interactions", ">", 2),  # with >2 interactions with ATP
        ("system_num_ligand_chains", "==", 1),  # and ATP is the only ligand
    ],
)
(
    "No. systems:",
    plindex.system_id.nunique(),
    "No. PDB entries:",
    plindex.entry_pdb_id.nunique(),
    "No. clusters:",
    plindex[pocket_cluster_column].nunique(),
)
('No. systems:', 5409, 'No. PDB entries:', 1484, 'No. clusters:', 154)
plindex[pocket_cluster_column].value_counts().head(10)
pocket_qcov__50__weak__component
c0      3562
c123     260
c11      259
c206     146
c204     141
c132      80
c391      75
c95       73
c205      58
c319      58
Name: count, dtype: int64

When we look at the ECOD topology names of the cluster “c11” for example, we see that these systems are GroEL equatorial domain-like domains. Interestingly, 7 systems from 2 different PDB entries don’t have ECOD annotations yet as they were recently released, but are indeed the same domain.

plindex[plindex[pocket_cluster_column] == "c11"].system_pocket_ECOD_t_name.value_counts(
    dropna=False
)
system_pocket_ECOD_t_name
GroEL equatorial domain-like    252
None                              7
Name: count, dtype: int64
plindex[
    (plindex[pocket_cluster_column] == "c11")
    & (plindex["system_pocket_ECOD_t_name"].isna())
].drop_duplicates("entry_pdb_id")
system_id entry_pdb_id entry_release_date pocket_qcov__50__weak__component system_pocket_ECOD_t_name split
527 8i6j__1__1.B__1.C 8i6j 2023-01-28 c11 None train
2756 8hki__1__1.I__1.Q 8hki 2022-11-27 c11 None train
plindex[plindex["system_pocket_ECOD_t_name"].isna()]["entry_pdb_id"].nunique()
290

Example: Diversity sampling#

The pli_unique_qcov__50__community column clusters systems such that the protein and ligand make similar interactions with the pocket. This clustering combines protein sequence and structural similarity (needed to obtain the pocket alignment), as well as the ligand-pocket interactions, making it a good proxy for unique kinds of binding present in the dataset.

Here’s an example of how one might use torch.utils.data.WeightedRandomSampler to sample training systems based on their PLI community cluster.

from torch.utils.data import WeightedRandomSampler

cluster_column = "pli_unique_qcov__50__community"

# Get train systems and their cluster labels
plindex = query_index(
    columns=["system_id", "entry_pdb_id", cluster_column], splits=["train"]
).drop_duplicates("system_id")

# Add the number of systems in each cluster to the dataframe
plindex = plindex.merge(
    plindex[cluster_column].value_counts().rename("cluster_num_systems"),
    left_on=cluster_column,
    right_index=True,
).reset_index(drop=True)

# Add the number of PDB entries in each cluster to the dataframe
plindex = plindex.merge(
    plindex.drop_duplicates("entry_pdb_id")[cluster_column]
    .value_counts()
    .rename("cluster_num_entries"),
    left_on=cluster_column,
    right_index=True,
).reset_index(drop=True)

# Ignore clusters with only one PDB entry
sample_from = plindex[plindex["cluster_num_entries"] > 1].reset_index(drop=True)

# Calculate the weight for each cluster
cluster_weights = 1.0 / sample_from.cluster_num_systems.values

# Create a WeightedRandomSampler and sample systems from the train set
sampler = WeightedRandomSampler(
    weights=cluster_weights, num_samples=len(cluster_weights)
)
sampled_indices = list(sampler)
sampled_plindex = (
    sample_from.iloc[sampled_indices]
    .reset_index(drop=True)
    .drop(columns=["cluster_num_systems", "cluster_num_entries"])
)

# Add the number of sampled systems in each cluster to the sampled dataframe
sampled_plindex = sampled_plindex.merge(
    sampled_plindex[cluster_column].value_counts().rename("cluster_num_systems"),
    left_on=cluster_column,
    right_index=True,
).reset_index(drop=True)

# Add the number of sampled PDB entries in each cluster to the sampled dataframe
sampled_plindex = sampled_plindex.merge(
    sampled_plindex.drop_duplicates("entry_pdb_id")[cluster_column]
    .value_counts()
    .rename("cluster_num_entries"),
    left_on=cluster_column,
    right_index=True,
)
print("No. of original systems: ", sample_from.system_id.nunique())
print("No. of nonredundant sampled systems: ", sampled_plindex.system_id.nunique())
print("No. of original clusters: ", sample_from[cluster_column].nunique())
print("No. of sampled clusters: ", sampled_plindex[cluster_column].nunique())
No. of original systems:  259772
No. of nonredundant sampled systems:  57627
No. of original clusters:  2723
No. of sampled clusters:  2721

Let’s see how the sampling process has affected the distribution of cluster sizes.

fig, ax = plt.subplots(1, 2, figsize=(10, 3))

cluster_sizes = (
    sample_from.drop_duplicates(cluster_column)
    .cluster_num_systems.value_counts()
    .sort_index()
)
ax[0].hist(cluster_sizes.index, weights=cluster_sizes.values, bins=20, log=True)
ax[0].set_xlabel("Cluster Size")
ax[0].set_ylabel("# Clusters (log scale)")
ax[0].grid(True, which="both", ls="-", alpha=0.2)

cluster_sizes = (
    sampled_plindex.drop_duplicates(cluster_column)
    .cluster_num_systems.value_counts()
    .sort_index()
)
ax[1].hist(cluster_sizes.index, weights=cluster_sizes.values, bins=20, log=False)
ax[1].set_xlabel("Cluster Size")
ax[1].set_ylabel("# Clusters")
Text(0, 0.5, '# Clusters')
../_images/9080345b2e6d830845256b4a14a3b29ef21c13219991a09aa1eeea278821792c.png