MLSB/PLINDER Data Access#
The goal of this tutorial is to provide background information for the MLSB/PLINDER challenge, as well as a simple hands-on demo for how participants can access and use the PLINDER dataset.
Background information #
For background information on the rules of the challenge, see MLSB/P(L)INDER challenge rules for more information.
Accessing and loading data for training #
Here, we are going to demonstrate how to get the key input data:
protein receptor fasta sequence
small molecules ligand SMILES string
access to linked apo and pred structure
In the process, we will show:
How to download the PLINDER data
How to query PLINDER index and splits to select relevant data using
plinder.core
APIExtract task-specific data one might want to use for training a task-specific ML model, eg. one protein, one ligand
How to use
plinder.core
API to:supply dataset inputs for
train
orval
splitsload linked
apo
andpred
structuresuse diversity subsampling based on cluster labels
Download PLINDER #
To download, run: plinder_download --release 2024-06 --iteration v2 --yes
This will download and unpack all neccesary files. For more information on download check out Dataset Tutorial
Note
The dataset is hundreds of gigabytes in size; downloading and extracting should take about 40 minutes. If you want to play around with a toy example dataset, please use --iteration tutorial
%load_ext autoreload
%autoreload 2
from __future__ import annotations
import os
import pandas as pd
os.environ["GCLOUD_PROJECT"] = "plinder"
Interacting with dataset #
We recommend users interact with the dataset using PLINDER Python API.
To install the API run: pip install plinder[loader]
.
If you are using zsh
terminal, you will have to quote the package like "plinder[loader]"
from plinder.core.scores import query_index
Load system index with selected columns from annotations table#
For a full list with descriptions, please refer to docs.
# get plinder index with selected annotation columns specified
plindex = query_index(
columns=["system_id", "ligand_id",
"ligand_rdkit_canonical_smiles", "ligand_is_ion",
"ligand_is_artifact", "system_num_ligand_chains",
"system_num_neighboring_protein_chains",
"pli_qcov__100__strong__component"
],
filters=[
("system_type", "==", "holo"),
("system_num_neighboring_protein_chains", "<=", 5)
]
)
plindex.head()
system_id | ligand_id | ligand_rdkit_canonical_smiles | ligand_is_ion | ligand_is_artifact | system_num_ligand_chains | system_num_neighboring_protein_chains | pli_qcov__100__strong__component | |
---|---|---|---|---|---|---|---|---|
0 | 3grt__1__1.A_2.A__1.B | 3grt__1__1.B | Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)... | False | False | 1 | 2 | c243140 |
1 | 3grt__1__1.A_2.A__1.C | 3grt__1__1.C | N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](... | False | False | 1 | 2 | c169758 |
2 | 3grt__1__1.A_2.A__2.B | 3grt__1__2.B | Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)... | False | False | 1 | 2 | c242976 |
3 | 3grt__1__1.A_2.A__2.C | 3grt__1__2.C | N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](... | False | False | 1 | 2 | c173553 |
4 | 1grx__1__1.A__1.B | 1grx__1__1.B | N[C@@H](CCC(=O)N[C@@H](CS)C(=O)NCC(=O)O)C(=O)O | False | False | 1 | 1 | c186761 |
# Display number of system neighboring protein chains
plindex.groupby("system_num_neighboring_protein_chains").system_id.count()
system_num_neighboring_protein_chains
1 406826
2 213268
3 43478
4 10835
5 1783
Name: system_id, dtype: int64
Extracting specific data using annotations #
As we can see just from the data tables above - a significant fraction of PLINDER systems contain complex multi protein chain systems.
Task specific selection#
If we would like to focus on single protein and single ligand systems for training, we can use the annotated columns to filter out systems that:
contain only one protein chain
only one ligand
Remember: In PLINDER artifacts and (single atom) ions are also included in the index if they are part of the pocket.
We can use columns
ligand_is_ion
andligand_is_artifact
to only select “proper” ligands.
Let’s find out how many annotated ligands are “proper”.
# define "proper" ligands that are not ions or artifacts
plindex["ligand_is_proper"] = (
~plindex["ligand_is_ion"] & ~plindex["ligand_is_artifact"]
)
plindex.groupby("ligand_is_proper").system_id.count()
ligand_is_proper
False 128401
True 547789
Name: system_id, dtype: int64
User choice#
The annotations table gives flexibility to choose the systems for training:
One could strictly choose to use only the data that contains single protein single ligand systems
Alternatively one could expand the number of systems to include systems containing single proper ligands, and optionally ignore the artifacts and ions in the pocket
Let’s compare the numbers of such systems!
# create mask for single receptor single ligand systems
systems_1p1l = (plindex["system_num_neighboring_protein_chains"] == 1) & (plindex["system_num_ligand_chains"] == 1)
# make count of these "proper" ligands per system
plindex["system_proper_num_ligand_chains"] = plindex.groupby("system_id")["ligand_is_proper"].transform("sum")
# create mask only for single receptor single "proper" ligand systems
systems_proper_1p1l = (plindex["system_num_neighboring_protein_chains"] == 1) & (plindex["system_proper_num_ligand_chains"] == 1) & plindex["ligand_is_proper"]
print(f"Number of single receptor single ligand systems: {sum(systems_1p1l)}")
print(f"Number of single receptor single \"proper\" ligand systems: {sum(systems_proper_1p1l)}")
Number of single receptor single ligand systems: 238228
Number of single receptor single "proper" ligand systems: 282433
As we can see - the second choice can provide up to 20% more data for training, however, the caveat is that some of the interactions made by artifacts or ions may influence the binding pose of the “proper” ligand. The user could come up with further strategies to filtering using annotations table or external tools, but this is beyond the scope of this tutorial.
Using PLINDER splits#
Now, after curating the systems of interest, let’s have a look at the splits using PLINDER API.
How to use
plinder.core
API to supply dataset inputs fortrain
orval
splits
from plinder.core import get_split
Accessing the splits#
The get_split
function provides the current PLINDER split, the detailed description of this DataFrame is provide in the dataset documentation, but for our practical purposes we are mostly interested in system_id
and split
that assigns each of our systems to a specific split category.
# get the current plinder split
split_df = get_split()
split_df.head()
system_id | uniqueness | split | cluster | cluster_for_val_split | system_pass_validation_criteria | system_pass_statistics_criteria | system_proper_num_ligand_chains | system_proper_pocket_num_residues | system_proper_num_interactions | system_proper_ligand_max_molecular_weight | system_has_binding_affinity | system_has_apo_or_pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 101m__1__1.A__1.C_1.D | 101m__A__C_D_c188899 | train | c14 | c0 | True | True | 1 | 27 | 20 | 616.177293 | False | False |
1 | 102m__1__1.A__1.C | 102m__A__C_c237197 | train | c14 | c0 | True | True | 1 | 26 | 20 | 616.177293 | False | True |
2 | 103m__1__1.A__1.C_1.D | 103m__A__C_D_c252759 | train | c14 | c0 | False | True | 1 | 26 | 16 | 616.177293 | False | False |
3 | 104m__1__1.A__1.C_1.D | 104m__A__C_D_c274687 | train | c14 | c0 | False | True | 1 | 27 | 21 | 616.177293 | False | False |
4 | 105m__1__1.A__1.C_1.D | 105m__A__C_D_c221688 | train | c14 | c0 | False | True | 1 | 28 | 20 | 616.177293 | False | False |
Some specific method developers working on flexible docking may also find handy the annotation column system_has_apo_or_pred
indicating if the system has available apo
or pred
linked structures (see later).
split_df.groupby(["split", "system_has_apo_or_pred"]).system_id.count()
split system_has_apo_or_pred
removed False 56876
True 41842
test False 548
True 488
train False 189703
True 119437
val False 456
True 376
Name: system_id, dtype: int64
For simplicity let’s merge plindex and split DataFrames into one
# merge to a single DataFrame
plindex_split = plindex.merge(split_df, on="system_id", how="left")
Getting links to apo
or pred
structures #
For users interested in including apo
and pred
structures in their workflow, all the information needed can be obtained from the function query_links()
from plinder.core.scores import query_links
links_df = query_links(
columns=["reference_system_id", "id", "sort_score"],
)
Note
The table is sorted by sort_score
that is resolution for apo
s and plddt
for pred
s. The apo
or pred
is specified in the additionally added filename
and kind
column that specifies if the structure was sourced from PDB or AF2DB, respectively.
links_df.head()
reference_system_id | id | sort_score | filename | kind | |
---|---|---|---|---|---|
0 | 6pl9__1__1.A__1.C | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
1 | 6ahh__1__1.A__1.G | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
2 | 5b59__1__1.A__1.B | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
3 | 3ato__1__1.A__1.B | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
4 | 6mx9__1__1.A__1.K | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
If a user wants to consider only one linked structure per system - we can easily drop duplicates, first sorting by sort_score
. Using this priority score, pred
structures will not be used unless there is no apo
available. Alternative can be achieved by sorting with ascending=False
, or filtering by kind=="pred"
column.
single_links_df = links_df.sort_values("sort_score", ascending=True).drop_duplicates("reference_system_id")
single_links_df.head()
reference_system_id | id | sort_score | filename | kind | |
---|---|---|---|---|---|
0 | 6pl9__1__1.A__1.C | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
110 | 6agr__1__1.A__1.G | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
111 | 4qgz__1__1.A__1.C | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
112 | 4owa__2__1.B__1.NA | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
113 | 6wgo__1__1.A__1.E | 2vb1_A | 0.65 | /home/runner/.local/share/plinder/2024-06/v2/l... | apo |
Finding the relevant linked structures#
Now that we have links to apo
/ pred
structures, we can see how many of those are available for our single protein single ligand systems
plindex_split[systems_1p1l].groupby(["split", "system_has_apo_or_pred"]).system_id.count()
split system_has_apo_or_pred
removed False 4720
True 41685
test False 59
True 487
train False 33897
True 118925
val False 26
True 374
Name: system_id, dtype: int64
plindex_split_1p1l_links = plindex_split[systems_1p1l].merge(single_links_df, left_on="system_id", right_on="reference_system_id", how="left")
# let's check how many systems have linked structures
plindex_split_1p1l_links['system_has_linked_apo_or_pred'] = ~plindex_split_1p1l_links.filename.isna()
plindex_split_1p1l_links.groupby(["split", "system_has_linked_apo_or_pred"]).system_id.count()
split system_has_linked_apo_or_pred
removed False 7098
True 39307
test False 76
True 470
train False 47118
True 105704
val False 30
True 370
Name: system_id, dtype: int64
Selecting final dataset#
For this example, let’s select only the set that has linked structures for flexible docking
plindex_final_df = plindex_split_1p1l_links[
(plindex_split_1p1l_links.system_has_linked_apo_or_pred) & (plindex_split_1p1l_links.split != "removed")
]
plindex_final_df.groupby(["split", "system_has_linked_apo_or_pred"]).system_id.count()
split system_has_linked_apo_or_pred
test True 470
train True 105704
val True 370
Name: system_id, dtype: int64
Loading dataset by split #
from plinder.core.loader import get_model_input_files
Note
function get_model_input_files()
accepts split =
“train”, “val” or “test”
sample_dataset = get_model_input_files(
plindex_final_df,
split = "val",
max_num_sample = 10,
num_alternative_structures = 1,
)
print(f"Loaded dataset size: {len(sample_dataset)}")
Loaded dataset size: 10
Note
if files not already available this downloads them to ~/.local/share/plinder/{PLINDER_RELEASE}/{PLINDER_ITERATION}
directory
# Inspect data
sample_dataset[0]
(PosixPath('/home/runner/.local/share/plinder/2024-06/v2/systems/4cj6__1__1.A__1.B/sequences.fasta'),
'CC1=C(/C=C/C(C)=C/C=C/C(C)=C/C=O)C(C)(C)CCC1',
['/home/runner/.local/share/plinder/2024-06/v2/linked_structures/pred/4cj6__1__1.A__1.B/P12271_A/superposed.cif'])
Using PLINDER clusters in sampling #
In general, diversity can be sampled using cluster information described here.
Here, we have provided an example of how one might use the function get_diversity_samples
which is based on torch.utils.data.WeightedRandomSampler
.
Note
This example function is provided for demonstration purposes and users are encouraged to come up with sampling strategy that suits their need.
For this example, we are going to use the sample dversity based on the following parameters:
We loaded pli_qcov__100__strong__component
to plindex to provide an example of how one could use cluster assignment for diversity sampling.
from plinder.core.loader import get_diversity_samples
subsampled_df = get_diversity_samples(split_df=plindex_final_df,
cluster_tag="pli_qcov__100__strong__component"
)
len(plindex_final_df), len(subsampled_df)
(118309, 60330)
Tip
Currently, PLINDER Python API checks the remote source of data for consistency and downloads new data locally if there is any changes; with this comes some performance trade-off.
If you have PLINDER is downloaded locally, you can use it in offline mode to save time for data queries when running production training: os.environ["PLINDER_OFFLINE"] = "true"
Unset this variable os.environ.pop("PLINDER_OFFLINE", None)
to make sure that your PLINDER data files are up to date!