The PLINDEX#

import matplotlib.pyplot as plt
import mols2grid
import numpy as np

Querying and filtering the index#

Your main entry point to the dataset is the annotations table or the index, which is a Parquet file containing all the annotations for each system in the dataset.

Note

The entire file has 745 columns, >1.3M rows, and takes 24G of RAM to load into memory :::

We provide a query_index function to access the index and filter it based on the columns and criteria you need.

from plinder.core.scores import query_index

The core of the PLINDER dataset is a collection of Protein-Ligand Interaction (PLI) systems extracted from the Protein Data Bank (PDB). The curation process in a nutshell is as follows:

For every PDB entry, we generate all available biological assemblies (biounits).
For each biounit, we identify all ligands and all protein chains within 6 Å of any ligand.
Ligands within 4 Å of each other are merged into a single PLI system.
For each system, we provide a range of detailed annotations and files to facilitate in-depth analysis and to enable a variety of use-cases.

Thus, a PLI system is defined as a collection of protein and ligand chains that are in close proximity to each other within a given biounit of a given PDB entry. The pocket of a system is defined as the set of protein residues within 6 Å of the ligands in the system.

Let’s look at all the columns that define a system:

plindex = query_index(
    columns=[
        "system_id",
        "entry_pdb_id",
        "system_biounit_id",
        "system_protein_chains_asym_id",
        "system_ligand_chains_asym_id",
        "ligand_instance_chain",
    ]
)
plindex.head(20)

	system_id	entry_pdb_id	system_biounit_id	system_protein_chains_asym_id	system_ligand_chains_asym_id	ligand_instance_chain	split
0	3grt__1__1.A_2.A__1.B	3grt	1	[1.A, 2.A]	[1.B]	1.B	train
1	3grt__1__1.A_2.A__1.C	3grt	1	[1.A, 2.A]	[1.C]	1.C	train
2	3grt__1__1.A_2.A__2.B	3grt	1	[1.A, 2.A]	[2.B]	2.B	train
3	3grt__1__1.A_2.A__2.C	3grt	1	[1.A, 2.A]	[2.C]	2.C	train
4	1grx__1__1.A__1.B	1grx	1	[1.A]	[1.B]	1.B	train
5	6grf__1__1.A__1.D	6grf	1	[1.A]	[1.D]	1.D	train
6	6grf__2__1.B__1.E	6grf	2	[1.B]	[1.E]	1.E	train
7	3grj__2__1.B__1.I_1.L_1.Q	3grj	2	[1.B]	[1.I, 1.L, 1.Q]	1.I	train
8	3grj__2__1.B__1.I_1.L_1.Q	3grj	2	[1.B]	[1.I, 1.L, 1.Q]	1.L	train
9	3grj__2__1.B__1.I_1.L_1.Q	3grj	2	[1.B]	[1.I, 1.L, 1.Q]	1.Q	train
10	3grj__1__1.A__1.F	3grj	1	[1.A]	[1.F]	1.F	train
11	3grj__1__1.A__1.H	3grj	1	[1.A]	[1.H]	1.H	train
12	8grn__1__1.A__1.C	8grn	1	[1.A]	[1.C]	1.C	train
13	8grn__1__1.B__1.D	8grn	1	[1.B]	[1.D]	1.D	train
14	4grc__1__1.A__1.C_1.D	4grc	1	[1.A]	[1.C, 1.D]	1.C	train
15	4grc__1__1.A__1.C_1.D	4grc	1	[1.A]	[1.C, 1.D]	1.D	train
16	4grc__1__2.A__2.C_2.D	4grc	1	[2.A]	[2.C, 2.D]	2.C	train
17	4grc__1__2.A__2.C_2.D	4grc	1	[2.A]	[2.C, 2.D]	2.D	train
18	4grr__1__1.B_1.C__1.K	4grr	1	[1.B, 1.C]	[1.K]	1.K	train
19	2grt__1__1.A_2.A__1.B	2grt	1	[1.A, 2.A]	[1.B]	1.B	train

Thus, a system is uniquely qualified by its system_id which is a combination of

entry_pdb_id - the PDB identifier
system_biounit_id - the biological assembly identifier
system_protein_chains_asym_id - The list of protein chains in the system defined by their <instance>.<label_asym_id>
system_ligand_chains_asym_id - The list of ligand chains in the system defined by their <instance>.<label_asym_id>

Systems containing multiple ligands (e.g 4grc__1__1.A__1.C_1.D) span multiple rows, where each row represents a different ligand in the system (as seen in the ligand_instance_chain column).

print(f"Number of ligand chains: {plindex.shape[0]}")
print(f"Number of systems: {plindex.system_id.nunique()}")
print(
    f"Number of biounits: {plindex[['entry_pdb_id', 'system_biounit_id']].drop_duplicates().shape[0]}"
)
print(f"Number of PDB IDs: {plindex.entry_pdb_id.nunique()}")

Number of ligand chains: 419538
Number of systems: 309972
Number of biounits: 108206
Number of PDB IDs: 77474

While query_index by default loads all the systems in the train and val splits, not every returned system may be useful for training your model.

Single-ligand single-protein predictors#

As an example, we consider the case of training single-ligand single-protein models. One way to filter training data is as follows:

plindex_single = query_index(
    filters=[
        ("system_num_ligand_chains", "==", 1),
        ("system_num_protein_chains", "==", 1),
    ],
    splits=["train"],
)

plindex_single.system_id.nunique()

However, as PLINDER also considers ions and crystallization artifacts as ligands if they are within 4 Å of a non-ion non-artifact ligand, there are also systems in PLINDER which only have one “proper” ligand. So, another strategy would be to only train on the proper ligands and ignore the ions and artifacts in the system.

plindex_single_proper = query_index(
    filters=[
        ("system_proper_num_ligand_chains", "==", 1),
        ("system_num_protein_chains", "==", 1),
        ("ligand_is_proper", "==", True),  # filters out all other ligands in the system
    ],
    splits=["train"],
)

plindex_single_proper.system_id.nunique()

This can provide up to 20% more data for training, however, the caveat is that some of the interactions made by artifacts or ions may influence the binding pose of the “proper” ligand.

One could also choose to include multi-ligand systems but only train with one ligand at a time, and the same for multi-protein. These choices are up to the user and we provide the annotations to enable such choices.

Annotations#

There are 745 columns in the full index. Of course, not every one will be useful for your use-case so we’ll go through some common categories of them and some use-cases requiring different annotations. You can get the full list of columns with descriptions in the index docs.

Structure quality#

A core principle of PLINDER is to be able to annotate which systems are of high enough experimental structure quality to be reliably used as the ground truth for measuring model performance. As the quality of experimentally resolved structures can vary significantly and many crystal structures with ligands contain missing atoms or missing residues in the binding site, comparing prediction results to lower quality structures can incorrectly skew the perception of their performance.

quality

You can access all the crystal structure validation information extracted from PDB validation reports as well as crystal contact annotations by looking for columns starting with “entry_validation”, “system_pocket_validation”, “system_ligand_validation” etc.

For simplicity, we have “system_pass_validation_criteria” as a column that can be used to filter systems which pass our quality definitions:

class QualityCriteria:
    max_entry_resolution: float = 3.5
    max_entry_r: float = 0.4
    max_entry_rfree: float = 0.45
    max_entry_r_minus_rfree: float = 0.05
    ligand_max_num_unresolved_heavy_atoms: int = 0  # except for covalent
    ligand_max_alt_count: int = 1  # misnomer: this counts number of total configurations
    ligand_min_average_occupancy: float = 0.8
    ligand_min_average_rscc: float = 0.8
    ligand_max_average_rsr: float = 0.3
    ligand_max_percent_outliers_clashes: float = 0
    ligand_max_fraction_atoms_with_crystal_contacts: float = 0
    pocket_max_num_unresolved_heavy_atoms: int = 0
    pocket_max_alt_count: int = 1  # same as above
    pocket_min_average_occupancy: float = 0.8
    pocket_min_average_rscc: float = 0.8
    pocket_max_average_rsr: float = 0.3
    pocket_max_percent_outliers_clashes: int = 100

plindex = query_index(
    columns=["system_id", "system_pass_validation_criteria"],
    splits=["train", "val", "test"],
)

data = (
    plindex.drop_duplicates("system_id")
    .sort_values(by="system_pass_validation_criteria")
    .groupby(["split", "system_pass_validation_criteria"])
    .system_id.count()
    .unstack()
)
data_percentage = data.div(data.sum(axis=1), axis=0) * 100
ax = data_percentage.plot(
    kind="bar", stacked=True, figsize=(5, 3), color=["firebrick", "darkseagreen"]
)
ax.set_xticklabels(
    [label.get_text().upper() for label in ax.get_xticklabels()], rotation=0
)
ax.set_xlabel("")
ax.set_ylabel("% passing criteria")
ax.get_legend().remove()
for container, count_data in zip(ax.containers, data.values.T):
    ax.bar_label(container, labels=count_data, label_type="edge")

../_images/3ec758abdafd38ced6452107a4766f65871b0118d03ea0ab8e2fde80192fb113.png

Structure completeness#

Related to structure quality, one aspect that is important to consider when using these structures in your training is completeness. While the inputs for prediction may be a protein sequence and a SMILES string, the protein-ligand complex structure that you get from the PDB may not have all the residues in the protein or all the atoms in the ligand resolved.

plindex = query_index(
    columns=[
        "system_id",
        "ligand_num_unresolved_heavy_atoms",
        "ligand_num_heavy_atoms",
        "system_protein_chains_num_unresolved_residues",
        "system_pocket_validation_num_unresolved_heavy_atoms",
        "system_protein_chains_total_length",
    ],
    splits=["train", "val", "test"],
)

Here we calculate the fraction of unresolved residues in the protein chains and the fraction of unresolved heavy atoms in the ligand, to see their distributions across the splits.

plindex[
    "system_protein_chains_total_num_unresolved_residues"
] = plindex.system_protein_chains_num_unresolved_residues.map(sum)
plindex["system_protein_chains_fraction_unresolved_residues"] = (
    plindex.system_protein_chains_total_num_unresolved_residues
    / plindex.system_protein_chains_total_length
)
plindex["ligand_fraction_unresolved_heavy_atoms"] = (
    plindex.ligand_num_unresolved_heavy_atoms / plindex.ligand_num_heavy_atoms
)

fig, axes = plt.subplots(1, 3, figsize=(10, 3), sharex=True, sharey=True)
grouped_data = plindex.drop_duplicates("system_id").groupby("split")
split_colors = {
    "train": "#ff9999",
    "test": "#66b3ff",
    "val": "#7cc27c",
}
for i, (split, data) in enumerate(grouped_data):
    data.system_protein_chains_fraction_unresolved_residues.hist(
        ax=axes[i], density=True, color=split_colors[split]
    )
    axes[i].set_title(split.upper())
    if i == 0:
        axes[i].set_ylabel("Density")
fig.suptitle(
    "Fraction of Unresolved Protein Residues (SEQRES length - no. resolved residues)"
)
plt.tight_layout()

../_images/a965c26e6b48a3635cf1a6a7ec06f6618ac8f018bca63f0781e1fca0f2158f38.png

print("Percentage of ligands with no unresolved heavy atoms:")
for split in ["train", "val", "test"]:
    print(
        f'{split.capitalize()}: {100 * (plindex[plindex["split"] == split].ligand_fraction_unresolved_heavy_atoms == 0).sum() / plindex[plindex["split"] == split].shape[0]:.2f}%'
    )

Percentage of ligands with no unresolved heavy atoms:
Train: 82.34%
Val: 90.75%
Test: 93.80%

Pocket domains#

We annotated domains from different databases onto the protein chains of each system and then picked the one spanning the pocket residues of the system as the domain of the system pocket.

plindex = query_index(
    columns=["system_id", "system_pocket_ECOD_t_name", "system_pocket_kinase_name"],
)

plindex.drop_duplicates("system_id").system_pocket_ECOD_t_name.value_counts().head(10)

system_pocket_ECOD_t_name
Nucleoplasmin-like/VP (viral coat and capsid proteins)         14051
P-loop containing nucleoside triphosphate hydrolases           13463
Protein kinase                                                 11203
TIM barrels                                                     9384
Immunoglobulin/Fibronectin type III/E set domains/PapD-like     9293
NAD(P)-binding Rossmann-fold domains                            9213
Globin-like                                                     7218
Ribosomal protein L31e/gp120 outer domain                       7141
S2 subunit of coronavirus spike glycoprotein                    6701
Viral protein domain                                            6467
Name: count, dtype: int64

plindex.drop_duplicates("system_id").system_pocket_kinase_name.value_counts().head(10)

system_pocket_kinase_name
CDK2        494
p38a        413
EGFR        360
CK2a1       349
AurA        278
PIM1        211
ALK2        208
JAK2        191
IRAK4       186
MAPKAPK2    183
Name: count, dtype: int64

Ligand properties#

Molecular properties and annotations are calculated from the ligand SMILES strings

properties = [
    "ligand_molecular_weight",
    "ligand_num_rot_bonds",
    "ligand_num_rings",
    "ligand_num_hbd",
    "ligand_num_hba",
    "ligand_num_heavy_atoms",
    "ligand_crippen_clogp",
    "ligand_qed",
    "ligand_tpsa",
    "ligand_is_kinase_inhibitor",
]

plindex = query_index(
    columns=["system_id", "ligand_instance_chain", "ligand_unique_ccd_code"]
    + properties,
    splits=["train", "val", "test"],
    filters=[
        ("ligand_is_proper", "==", True)  # focusing on non-ion, non-artifact ligands
    ],
)
plindex.head(20)

	system_id	ligand_instance_chain	ligand_unique_ccd_code	ligand_molecular_weight	ligand_num_rot_bonds	ligand_num_rings	ligand_num_hbd	ligand_num_hba	ligand_num_heavy_atoms	ligand_crippen_clogp	ligand_qed	ligand_tpsa	ligand_is_kinase_inhibitor	split
0	3grt__1__1.A_2.A__1.B	1.B	FAD	785.157135	13	6	9	21	53	-2.42396	0.044261	362.93	False	train
1	3grt__1__1.A_2.A__1.C	1.C	TS2	721.288731	10	1	11	13	48	-4.04010	0.095714	313.27	False	train
2	3grt__1__1.A_2.A__2.B	2.B	FAD	785.157135	13	6	9	21	53	-2.42396	0.044261	362.93	False	train
3	3grt__1__1.A_2.A__2.C	2.C	TS2	721.288731	10	1	11	13	48	-4.04010	0.095714	313.27	False	train
4	1grx__1__1.A__1.B	1.B	GSH	307.083806	9	0	6	6	20	-2.20610	0.263437	158.82	True	train
5	6grf__1__1.A__1.D	1.D	HSR	221.089937	2	1	5	6	15	-3.07760	0.337239	119.25	False	train
6	6grf__2__1.B__1.E	1.E	HSR	221.089937	2	1	5	6	15	-3.07760	0.337239	119.25	False	train
7	3grj__2__1.B__1.I_1.L_1.Q	1.L	G14	188.058577	2	2	2	2	14	1.77490	0.754340	65.98	False	train
8	3grj__1__1.A__1.F	1.F	G14	188.058577	2	2	2	2	14	1.77490	0.754340	65.98	False	train
9	3grj__1__1.A__1.H	1.H	G14	188.058577	2	2	2	2	14	1.77490	0.754340	65.98	False	train
10	8grn__1__1.A__1.C	1.C	LPC	468.308466	21	0	2	6	31	4.43140	0.111161	102.29	False	train
11	8grn__1__1.B__1.D	1.D	LPC	468.308466	21	0	2	6	31	4.43140	0.111161	102.29	False	train
12	4grc__1__1.A__1.C_1.D	1.C	HEM	616.177293	8	6	2	6	43	4.73854	0.416598	96.26	False	train
13	4grc__1__2.A__2.C_2.D	2.C	HEM	616.177293	8	6	2	6	43	4.73854	0.416598	96.26	False	train
14	4grr__1__1.B_1.C__1.K	1.K	AVR	231.137162	3	2	1	4	17	1.98760	0.821636	60.39	False	train
15	2grt__1__1.A_2.A__1.B	1.B	FAD	785.157135	13	6	9	21	53	-2.42396	0.044261	362.93	False	train
16	2grt__1__1.A_2.A__1.C	1.C	GDS	612.151962	21	0	10	12	40	-3.87680	0.043742	317.64	False	train
17	2grt__1__1.A_2.A__2.B	2.B	FAD	785.157135	13	6	9	21	53	-2.42396	0.044261	362.93	False	train
18	2grt__1__1.A_2.A__2.C	2.C	GDS	612.151962	21	0	10	12	40	-3.87680	0.043742	317.64	False	train
19	7grp__1__1.A__1.C	1.C	VXQ	149.084064	1	2	1	2	11	1.08020	0.648103	35.25	False	train

binwidths = {
    "ligand_molecular_weight": ("Molecular weight", 50),
    "ligand_num_heavy_atoms": ("Number of heavy atoms", 5),
    "ligand_num_rot_bonds": ("Number of rotatable bonds", 2),
    "ligand_num_rings": ("Number of rings", 1),
    "ligand_tpsa": ("Topological polar surface area", 10),
    "ligand_crippen_clogp": ("Crippen logP", 1),
}
fig, axes = plt.subplots(3, 2, figsize=(10, 7))
axes = axes.flatten()
for i, prop in enumerate(binwidths):
    ax = axes[i]
    plindex.groupby("split")[prop].plot(
        kind="hist",
        density=True,
        alpha=0.5,
        histtype="stepfilled",
        legend=True,
        ax=ax,
        color=split_colors,
        bins=np.arange(
            min(plindex[prop]),
            max(plindex[prop]) + binwidths[prop][1],
            binwidths[prop][1],
        ),
    )
    ax.set_xlabel(binwidths[prop][0])
    ax.set_ylabel("Density")
plt.tight_layout()
plt.show()

../_images/76687e8baa9e997f2415bf58e7334adb98430a159e2900d44a68fa1934a17738.png

These properties were further used to categorize the ligands into different types.

ligand_types = [
    f"ligand_is_{x}"
    for x in [
        "lipinski",
        "covalent",
        "cofactor",
        "oligo",
        "ion",
        "fragment",
        "artifact",
    ]
]

These categories were also aggregated to the system level for easier filtering, with system_has_<category> as a column indicating whether any ligand in the system is of that category.

plindex = query_index(
    columns=["system_id", "ligand_unique_ccd_code"] + ligand_types,
    splits=["train", "val", "test"],
)

labels = [c.replace("ligand_is_", "").capitalize() for c in ligand_types]
bar_colors = plt.cm.Pastel2.colors
split_names = ["train", "val", "test"]
fig, axes = plt.subplots(1, 3, figsize=(10, 3))
for i, split in enumerate(split_names):
    ax = axes[i]
    split_data = plindex[plindex["split"] == split]
    bars = ax.bar(
        labels,
        split_data[ligand_types].mean().mul(100).to_list(),
        width=1,
        color=bar_colors,
        edgecolor="black",
        label=split,
        linewidth=1,
    )
    ax.set_xticks(np.arange(len(labels)), labels, rotation=70)
    ax.set_ylim(0, 100)
    ax.set_title(split.upper())
    counts = split_data[ligand_types].sum().to_list()
    for bar, count in zip(bars, counts):
        ax.text(
            bar.get_x() + bar.get_width() / 2.0,
            bar.get_height() + 2,
            f"{count}",
            ha="center",
            va="bottom",
            rotation=70,
            fontsize=10,
        )

../_images/35d3c4d24df0df6ffb5caf37fbc9fade04e40b7bbdea87a34190e4916343eebb.png

PLI-specific properties#

There are also some ligand properties that are specific to the interaction with the particular protein pocket present in the system. These include things like:

The kinds of interactions the ligand may have with the protein, calculated using PLIP
Experimental binding affinity, pulled from BindingDB when available
Potential ligand-protein crystal contacts, defined as ligand-protein contacts below 5 Å which are not in the same asymmetric unit (symmetry mates) and not in the system biounit

pli_specific = [
    "system_fraction_atoms_with_crystal_contacts",
    "system_num_crystal_contacted_residues",
    "ligand_binding_affinity",
    "system_has_binding_affinity",
    "ligand_interactions",
    "system_num_interactions",
]

plindex = query_index(
    columns=["system_id"] + pli_specific,
    splits=["train", "val", "test"],
)

You could filter out systems with crystal contacts:

(
    plindex.drop_duplicates("system_id").system_fraction_atoms_with_crystal_contacts > 0
).sum()

If your model has an additional component for predicting binding affinity, you could see how much data is available for training and evaluation:

plindex.drop_duplicates("system_id").groupby(
    "split"
).system_has_binding_affinity.value_counts()

split  system_has_binding_affinity
test   False                             862
       True                              174
train  False                          261204
       True                            47936
val    False                             685
       True                              147
Name: count, dtype: int64

plindex["ligand_binding_affinity"].hist()
plt.xlabel(r"$pK_i$ or $pK_D$")
plt.ylabel("No. of system ligands")

Text(0, 0.5, 'No. of system ligands')

../_images/1fd3a63764c5256bc8790b25e6bfd94ffd251e4dd5aae0edba396e8eb7e44017.png

You could even look deeper into the kinds of protein-ligand interactions present in your training data:

plindex.groupby("split").system_num_interactions.describe()

	count	mean	std	min	25%	50%	75%	max
split
test	1436.0	16.591226	9.886007	3.0	9.0	15.0	22.0	51.0
train	418381.0	14.448139	12.087511	0.0	5.0	12.0	21.0	722.0
val	1157.0	16.078652	10.204647	0.0	8.0	14.0	22.0	49.0

plindex["ligand_interactions"].values[0]

array(['1.A_36_type:hydrogen_bonds__protisdon:False__sidechain:False',
       '1.A_113_type:hydrogen_bonds__protisdon:False__sidechain:False',
       '1.A_113_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_312_type:hydrogen_bonds__protisdon:False__sidechain:False',
       '1.A_320_type:hydrogen_bonds__protisdon:False__sidechain:False',
       '1.A_322_type:hydrogen_bonds__protisdon:False__sidechain:True',
       '1.A_322_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_40_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_40_type:hydrophobic_contacts',
       '1.A_13_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_13_type:hydrogen_bonds__protisdon:True__sidechain:True',
       '1.A_11_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_35_type:hydrogen_bonds__protisdon:True__sidechain:True',
       '1.A_34_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_49_type:hydrogen_bonds__protisdon:True__sidechain:True',
       '1.A_180_type:hydrogen_bonds__protisdon:True__sidechain:True',
       '1.A_180_type:pi_stacks__stack_type:T',
       '1.A_41_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_14_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_15_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_140_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '1.A_314_type:hydrogen_bonds__protisdon:True__sidechain:False',
       '2.A_450_type:hydrogen_bonds__protisdon:False__sidechain:False'],
      dtype=object)

Beyond the split#

Here we mainly focused on systems in the train/val/test splits as the train and val splits are the ones that can be used for training models to either participate in the PLINDER leaderboard or compare your models’ performance to methods in the leaderboard.

However, PLINDER itself contains all PLI systems in the PDB (except those containing only crystallization artifacts), and you may be interested in this dataset as a whole. This can be queried with the splits=["*"] option.

plindex = query_index(
    columns=[
        "system_id",
        "entry_pdb_id",
        "system_num_protein_chains",
        "system_num_ligand_chains",
    ]
    + ligand_types,
    splits=["*"],
)
print("No. ligands:", len(plindex))
print("No. systems:", plindex.system_id.nunique())
print("No. PDB entries:", plindex.entry_pdb_id.nunique())

No. ligands: 1357906

No. systems: 990260
No. PDB entries: 143800

fig, ax = plt.subplots(figsize=(5, 3))
bars = ax.bar(
    labels,
    plindex[ligand_types].mean().mul(100).to_list(),
    width=1,
    color=bar_colors,
    edgecolor="black",
    label=split,
    linewidth=1,
)
ax.set_xticks(np.arange(len(labels)), labels, rotation=70)
ax.set_ylim(0, 100)
ax.set_title(f"No. PLINDER ligands: {len(plindex)}")
counts = plindex[ligand_types].sum().to_list()
for bar, count in zip(bars, counts):
    ax.text(
        bar.get_x() + bar.get_width() / 2.0,
        bar.get_height() + 2,
        f"{count}",
        ha="center",
        va="bottom",
        rotation=70,
        fontsize=10,
    )

print(plindex.system_num_protein_chains.value_counts())
print(plindex.system_num_ligand_chains.value_counts())

system_num_protein_chains
   857629
   312129
    73237
    19875
   14122
       ...  
      72
      64
      61
      52
      30
Name: count, Length: 74, dtype: int64
system_num_ligand_chains
   809509
   248916
   107613
    45856
    17065
       ...  
      84
      78
      74
      67
      46
Name: count, Length: 238, dtype: int64

../_images/3861b7b831ef060f4df345d8781465764ad1ea95c39dbb2a6bfc77530c688df7.png

When looking at distributions across the entire dataset, you’ll notice that there are quite a lot of ion systems, as well as systems with more than 5 protein or ligand chains.

Note

While we provide annotations for all systems, those which are not in the train and val splits may not be used for training models to participate in the MLSB PLINDER leaderboard. :::

Similarity clusters#

Similarity between two protein-ligand complexes can occur at various levels, including protein sequence, structural features, binding pocket characteristics, or ligand and interaction properties. We calculated a comprehensive set of similarity metrics to cover every combination of these dimensions — from identical systems, where the protein, pocket, interactions and ligand are the same, to systems that differ across all levels.

The entire set of similarity metrics is described here, and they were calculated across all pairs of systems having a Foldseek or MMseqs alignment. We then used graph clustering to group systems into clusters based on their similarity with a specific metric and threshold. There are three types of clusters available: strongly connected graph components (strong component), weakly connected graph components (weak component), and communities detected using asynchronous label propagation (community).

We can load the similarity clusters assigned to each system for different metrics, thresholds, and clustering types.

These cluster labels can be used in many ways with the annotations.

Example: ATP-binding pockets#

For example, maybe you are interested in seeing how many different kinds of ATP-binding pockets we have. First, let’s find all the analogs of ATP. We can do this by finding the 95% ECFP4 tanimoto similarity component of ATP and then getting the CCD codes of all the ligands in the same cluster:

ligand_cluster_column = "tanimoto_similarity_max__95__strong__component"
atp_ligand_cluster = query_index(
    columns=[
        ligand_cluster_column,
    ],
    filters=[
        ("ligand_unique_ccd_code", "==", "ATP"),
        ("system_num_ligand_chains", "==", 1),
    ],
)[ligand_cluster_column].values[0]

atp_analogs = query_index(
    columns=[
        "system_id",
        "entry_pdb_id",
        "ligand_unique_ccd_code",
        "ligand_rdkit_canonical_smiles",
    ],
    filters=[
        (ligand_cluster_column, "==", atp_ligand_cluster),
        ("system_num_ligand_chains", "==", 1),
    ],
).drop_duplicates("ligand_unique_ccd_code")
atp_analogs_code_set = set(atp_analogs.ligand_unique_ccd_code.unique())
atp_analogs_code_set

{'5FA', 'ADP', 'AQP', 'ATP', 'ZF9', 'ZSF'}

Let’s see how they look

grid = mols2grid.MolGrid(atp_analogs, smiles_col="ligand_rdkit_canonical_smiles")
grid.display(subset=["ligand_unique_ccd_code", "img"])

Now we can get the pocket clusters of all systems containing ATP analogs. Here we’re using pocket_qcov__50__weak__component, meaning a system within a cluster has some other system in that cluster with which it shares at least 50% of pocket residues when aligned.

pocket_cluster_column = "pocket_qcov__50__weak__component"
plindex = query_index(
    columns=[
        "system_id",
        "entry_pdb_id",
        "entry_release_date",
        pocket_cluster_column,
        "system_pocket_ECOD_t_name",
    ],
    filters=[
        (
            "ligand_unique_ccd_code",
            "in",
            atp_analogs_code_set,
        ),  # only ATP-binding systems
        ("ligand_num_interactions", ">", 2),  # with >2 interactions with ATP
        ("system_num_ligand_chains", "==", 1),  # and ATP is the only ligand
    ],
)

(
    "No. systems:",
    plindex.system_id.nunique(),
    "No. PDB entries:",
    plindex.entry_pdb_id.nunique(),
    "No. clusters:",
    plindex[pocket_cluster_column].nunique(),
)

('No. systems:', 5409, 'No. PDB entries:', 1484, 'No. clusters:', 154)

plindex[pocket_cluster_column].value_counts().head(10)

pocket_qcov__50__weak__component
c0      3562
c123     260
c11      259
c206     146
c204     141
c132      80
c391      75
c95       73
c205      58
c319      58
Name: count, dtype: int64

When we look at the ECOD topology names of the cluster “c11” for example, we see that these systems are GroEL equatorial domain-like domains. Interestingly, 7 systems from 2 different PDB entries don’t have ECOD annotations yet as they were recently released, but are indeed the same domain.

plindex[plindex[pocket_cluster_column] == "c11"].system_pocket_ECOD_t_name.value_counts(
    dropna=False
)

system_pocket_ECOD_t_name
GroEL equatorial domain-like    252
None                              7
Name: count, dtype: int64

plindex[
    (plindex[pocket_cluster_column] == "c11")
    & (plindex["system_pocket_ECOD_t_name"].isna())
].drop_duplicates("entry_pdb_id")

	system_id	entry_pdb_id	entry_release_date	pocket_qcov__50__weak__component	system_pocket_ECOD_t_name	split
527	8i6j__1__1.B__1.C	8i6j	2023-01-28	c11	None	train
2756	8hki__1__1.I__1.Q	8hki	2022-11-27	c11	None	train

plindex[plindex["system_pocket_ECOD_t_name"].isna()]["entry_pdb_id"].nunique()

Example: Diversity sampling#

The pli_unique_qcov__50__community column clusters systems such that the protein and ligand make similar interactions with the pocket. This clustering combines protein sequence and structural similarity (needed to obtain the pocket alignment), as well as the ligand-pocket interactions, making it a good proxy for unique kinds of binding present in the dataset.

Here’s an example of how one might use torch.utils.data.WeightedRandomSampler to sample training systems based on their PLI community cluster.

from torch.utils.data import WeightedRandomSampler

cluster_column = "pli_unique_qcov__50__community"

# Get train systems and their cluster labels
plindex = query_index(
    columns=["system_id", "entry_pdb_id", cluster_column], splits=["train"]
).drop_duplicates("system_id")

# Add the number of systems in each cluster to the dataframe
plindex = plindex.merge(
    plindex[cluster_column].value_counts().rename("cluster_num_systems"),
    left_on=cluster_column,
    right_index=True,
).reset_index(drop=True)

# Add the number of PDB entries in each cluster to the dataframe
plindex = plindex.merge(
    plindex.drop_duplicates("entry_pdb_id")[cluster_column]
    .value_counts()
    .rename("cluster_num_entries"),
    left_on=cluster_column,
    right_index=True,
).reset_index(drop=True)

# Ignore clusters with only one PDB entry
sample_from = plindex[plindex["cluster_num_entries"] > 1].reset_index(drop=True)

# Calculate the weight for each cluster
cluster_weights = 1.0 / sample_from.cluster_num_systems.values

# Create a WeightedRandomSampler and sample systems from the train set
sampler = WeightedRandomSampler(
    weights=cluster_weights, num_samples=len(cluster_weights)
)
sampled_indices = list(sampler)
sampled_plindex = (
    sample_from.iloc[sampled_indices]
    .reset_index(drop=True)
    .drop(columns=["cluster_num_systems", "cluster_num_entries"])
)

# Add the number of sampled systems in each cluster to the sampled dataframe
sampled_plindex = sampled_plindex.merge(
    sampled_plindex[cluster_column].value_counts().rename("cluster_num_systems"),
    left_on=cluster_column,
    right_index=True,
).reset_index(drop=True)

# Add the number of sampled PDB entries in each cluster to the sampled dataframe
sampled_plindex = sampled_plindex.merge(
    sampled_plindex.drop_duplicates("entry_pdb_id")[cluster_column]
    .value_counts()
    .rename("cluster_num_entries"),
    left_on=cluster_column,
    right_index=True,
)

print("No. of original systems: ", sample_from.system_id.nunique())
print("No. of nonredundant sampled systems: ", sampled_plindex.system_id.nunique())
print("No. of original clusters: ", sample_from[cluster_column].nunique())
print("No. of sampled clusters: ", sampled_plindex[cluster_column].nunique())

No. of original systems:  259772
No. of nonredundant sampled systems:  57627
No. of original clusters:  2723
No. of sampled clusters:  2721

Let’s see how the sampling process has affected the distribution of cluster sizes.

fig, ax = plt.subplots(1, 2, figsize=(10, 3))

cluster_sizes = (
    sample_from.drop_duplicates(cluster_column)
    .cluster_num_systems.value_counts()
    .sort_index()
)
ax[0].hist(cluster_sizes.index, weights=cluster_sizes.values, bins=20, log=True)
ax[0].set_xlabel("Cluster Size")
ax[0].set_ylabel("# Clusters (log scale)")
ax[0].grid(True, which="both", ls="-", alpha=0.2)

cluster_sizes = (
    sampled_plindex.drop_duplicates(cluster_column)
    .cluster_num_systems.value_counts()
    .sort_index()
)
ax[1].hist(cluster_sizes.index, weights=cluster_sizes.values, bins=20, log=False)
ax[1].set_xlabel("Cluster Size")
ax[1].set_ylabel("# Clusters")

Text(0, 0.5, '# Clusters')

../_images/9080345b2e6d830845256b4a14a3b29ef21c13219991a09aa1eeea278821792c.png