Python API tutorial#
Setup#
Installation#
plinder
is available on PyPI.
pip install plinder
Environment variable configuration#
We need to set environment variables to point to the release and iteration of choice.
For the sake of demonstration, this will be set to point to a smaller tutorial example
dataset, which are PLINDER_RELEASE=2024-06
and PLINDER_ITERATION=tutorial
.
Note
The version used for the preprint is PLINDER_RELEASE=2024-04
and
PLINDER_ITERATION=v1
, while the current version with updated annotations to be used
for the MLSB challenge isPLINDER_RELEASE=2024-06
and PLINDER_ITERATION=v2
.
%env PLINDER_LOG_LEVEL=0
%env PLINDER_ITERATION=tutorial
env: PLINDER_LOG_LEVEL=0
env: PLINDER_ITERATION=tutorial
As alternative these variables could also be set from terminal via export
(UNIX) or
set
(Windows).
Overview#
The user-facing subpackage of plinder
is plinder.core
.
This provides access to the underlying utility functions for accessing the dataset,
split and annotations.
It provides access to five top-level functions:
get_config()
: access PLINDER global configurationquery_index()
: access and query annotation table
In addition, it provides access to the data class PlinderSystem
for
reconstituting a PLINDER system from its system_id
.
To supplement these data, plinder.core.scores
provides functionality for
querying metrics, such as protein/ligand similarity and cluster identity.
Getting the configuration#
At first we get the configuration to check that all parameters are correctly set. In the snippet below, we will check, if the local and remote PLINDER paths point to the expected location.
import plinder.core.utils.config
cfg = plinder.core.get_config()
print(f"local cache directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")
local cache directory: /home/runner/.local/share/plinder/2024-06/tutorial
remote data directory: gs://plinder/2024-06/tutorial
Query annotations#
Query specific columns#
To query the annotations table for specific columns or filter by specific criteria, use
query_index()
.
The function could be called without any argument to yield a pandas
dataframe of system_id
,
entry_pdb_id
, and split
, and by default only loads systems present in the train
and val
splits.
from plinder.core.scores import query_index
# Get system_id, entry_pdb_id, and split columns of train and val splits
query_index()
system_id | entry_pdb_id | split | |
---|---|---|---|
0 | 3grt__1__1.A_2.A__1.B | 3grt | train |
1 | 3grt__1__1.A_2.A__1.C | 3grt | train |
2 | 3grt__1__1.A_2.A__2.B | 3grt | train |
3 | 3grt__1__1.A_2.A__2.C | 3grt | train |
4 | 1grx__1__1.A__1.B | 1grx | train |
... | ... | ... | ... |
419533 | 5lps__1__1.A__1.B | 5lps | train |
419534 | 5lps__1__2.A__2.B | 5lps | train |
419535 | 5lpv__1__1.A__1.B_1.C_1.D | 5lpv | train |
419536 | 5lpv__1__1.A__1.B_1.C_1.D | 5lpv | train |
419537 | 5lpv__1__1.A__1.B_1.C_1.D | 5lpv | train |
419538 rows × 3 columns
The function can be called by passing columns
argument, which is a list of
column names.
# Get specific columns from the annotation table
cols_of_interest = ["system_id", "entry_pdb_id", "entry_release_date", "entry_oligomeric_state", "entry_validation_clashscore", "entry_resolution"]
query_index(columns=cols_of_interest)
system_id | entry_pdb_id | entry_release_date | entry_oligomeric_state | entry_validation_clashscore | entry_resolution | split | |
---|---|---|---|---|---|---|---|
0 | 3grt__1__1.A_2.A__1.B | 3grt | 1997-02-12 | dimeric | 12.90 | 2.50 | train |
1 | 3grt__1__1.A_2.A__1.C | 3grt | 1997-02-12 | dimeric | 12.90 | 2.50 | train |
2 | 3grt__1__1.A_2.A__2.B | 3grt | 1997-02-12 | dimeric | 12.90 | 2.50 | train |
3 | 3grt__1__1.A_2.A__2.C | 3grt | 1997-02-12 | dimeric | 12.90 | 2.50 | train |
4 | 1grx__1__1.A__1.B | 1grx | 1993-10-01 | monomeric | NaN | NaN | train |
... | ... | ... | ... | ... | ... | ... | ... |
419533 | 5lps__1__1.A__1.B | 5lps | 2016-08-14 | dimeric | 1.47 | 1.27 | train |
419534 | 5lps__1__2.A__2.B | 5lps | 2016-08-14 | dimeric | 1.47 | 1.27 | train |
419535 | 5lpv__1__1.A__1.B_1.C_1.D | 5lpv | 2016-08-15 | monomeric | 1.68 | 2.70 | train |
419536 | 5lpv__1__1.A__1.B_1.C_1.D | 5lpv | 2016-08-15 | monomeric | 1.68 | 2.70 | train |
419537 | 5lpv__1__1.A__1.B_1.C_1.D | 5lpv | 2016-08-15 | monomeric | 1.68 | 2.70 | train |
419538 rows × 7 columns
Query annotations with specific filters#
We could also pass additional filters
, where each filter is a logical comparison
of a column name with some given value.
Only those rows, that fulfill all conditions, are returned.
See the description of
pandas.read_parquet()
for more information on the filter syntax.
# Query for single-ligand systems
filters = [("system_num_ligand_chains", "==", 1)]
query_index(columns=cols_of_interest, filters=filters)
system_id | entry_pdb_id | entry_release_date | entry_oligomeric_state | entry_validation_clashscore | entry_resolution | split | |
---|---|---|---|---|---|---|---|
0 | 3grt__1__1.A_2.A__1.B | 3grt | 1997-02-12 | dimeric | 12.90 | 2.50 | train |
1 | 3grt__1__1.A_2.A__1.C | 3grt | 1997-02-12 | dimeric | 12.90 | 2.50 | train |
2 | 3grt__1__1.A_2.A__2.B | 3grt | 1997-02-12 | dimeric | 12.90 | 2.50 | train |
3 | 3grt__1__1.A_2.A__2.C | 3grt | 1997-02-12 | dimeric | 12.90 | 2.50 | train |
4 | 1grx__1__1.A__1.B | 1grx | 1993-10-01 | monomeric | NaN | NaN | train |
... | ... | ... | ... | ... | ... | ... | ... |
230141 | 3lp0__1__1.A_1.B__1.F | 3lp0 | 2010-02-04 | dimeric | 5.97 | 2.79 | train |
230142 | 3lp0__2__1.A_1.B__1.F | 3lp0 | 2010-02-04 | dimeric | 5.97 | 2.79 | train |
230143 | 3lp0__2__2.A_2.B__2.F | 3lp0 | 2010-02-04 | dimeric | 5.97 | 2.79 | train |
230144 | 5lps__1__1.A__1.B | 5lps | 2016-08-14 | dimeric | 1.47 | 1.27 | train |
230145 | 5lps__1__2.A__2.B | 5lps | 2016-08-14 | dimeric | 1.47 | 1.27 | train |
230146 rows × 7 columns
Query systems in test, removed, or unassigned splits#
The splits
parameter is set to [“train”, “val”] by default but can take one or more of [“train”, “val”, “test”, “removed”, “all”]. By querying with [“*”], we get all 1.3 million rows, including those from the test and removed splits as well ion systems and systems with >5 protein and/or ligand chains (labelled “unassigned”):
df = query_index(columns=cols_of_interest, splits=["*"])
df.drop_duplicates("system_id")["split"].value_counts()
split
unassigned 581565
train 309140
removed 98718
val 832
test 5
Name: count, dtype: int64
Note
To load all the columns, users can use the function get_plindex()
which returns all the columns in the dataframe. However, since this table has over 1.3 million rows and over 700 columns, it has a significant memory footprint (~24G RAM) and users are advised to query only columns they need.
Query protein similarity#
The are three kinds of similarity datasets we provide:
Similarity between ligand bound structures (
holo
)Similarity between ligand bound and unbound protein structures (
apo
)Similarity between ligand bound and Alphafold predicted structures (
pred
) Any of these could be specified withquery_protein_similarity()
Note
With the full dataset, some similarity queries might require a large amount of memory. For example, `query_protein_similarity(search_db=“holo”, filters=[(“similarity”, “>”, “50”)]) will use up >500G RAM.:::
Here, we will query protein similarity dataset to assess the protein-ligand interaction similarity between example training and test set
from plinder.core.scores import query_protein_similarity
# Example train systems
train = ["7jxf__1__1.A_1.B__1.G", "1jtu__1__1.A_1.B__1.C_1.D",
"8f9d__2__1.C_1.D__1.G", "6a9a__1__1.A_2.A__2.C_2.D",
"1b5e__2__1.A_1.B__1.D"]
# Example test systems
test = ["1b5d__1__1.A_1.B__1.D", "1s2g__1__1.A_2.C__1.D",
"4agi__1__1.C__1.W", "4n7m__1__1.A_1.B__1.C",
"7eek__1__1.A__1.I"]
metric = "pli_unique_qcov"
threshold = 50
query_protein_similarity(
search_db="holo",
columns=["query_system", "target_system", "similarity"],
filters=[
("query_system", "in", test),
("target_system", "in", train),
("metric", "==", metric),
("similarity", ">=", str(threshold)),
],
)
query_system | target_system | similarity | |
---|---|---|---|
0 | 1b5d__1__1.A_1.B__1.D | 1b5e__2__1.A_1.B__1.D | 83 |
1 | 1b5d__1__1.A_1.B__1.D | 6a9a__1__1.A_2.A__2.C_2.D | 83 |
2 | 1b5d__1__1.A_1.B__1.D | 1jtu__1__1.A_1.B__1.C_1.D | 67 |
3 | 1b5d__1__1.A_1.B__1.D | 7jxf__1__1.A_1.B__1.G | 67 |
4 | 4n7m__1__1.A_1.B__1.C | 8f9d__2__1.C_1.D__1.G | 50 |
Working with a PLINDER system#
A PlinderSystem
is the representation of a single System.
This object provides access to all PDB entry and system level annotations, as well as
the structures of the system components.
Load systems from IDs#
To reconstitute PLINDER systems directly from a set of IDs use class PlinderSystem
.
from plinder.core import PlinderSystem
plinder_system = PlinderSystem(system_id="4agi__1__1.C__1.W")
Users can choose the granularity level of input: In the cases above the systems were specified by their system ID, but as alternative passing PDB IDs (or their two middle characters) is also possible, which gives you all systems corresponding to the given PDB IDs.
Accessing annotations#
The PlinderSystem.entry
property provides PDB entry-level annotations for that system.
Here, we will list the accessible categories of entry annotations and access the
oligomeric state of a given system.
entry_annotations = plinder_system.entry
print(list(entry_annotations.keys()))
print(entry_annotations["oligomeric_state"])
['pdb_id', 'release_date', 'oligomeric_state', 'determination_method', 'keywords', 'pH', 'resolution', 'chains', 'ligand_like_chains', 'systems', 'covalent_bonds', 'chain_to_seqres', 'validation', 'pass_criteria', 'water_chains', 'symmetry_mate_contacts']
dimeric
Instead, PlinderSystem.system
returns annotations on the system level.
Here, we will extract the SMILES string of the first ligand of a given system.
system_annotations = plinder_system.system
print(list(system_annotations.keys()))
# Show ligand smiles of the first ligand of a given system
print(system_annotations["ligands"][0]["rdkit_canonical_smiles"])
['pdb_id', 'biounit_id', 'ligands', 'ligand_validation', 'pocket_validation', 'pass_criteria']
C[Se][C@@H]1O[C@@H](C)[C@@H](O)[C@@H](O)[C@@H]1O
Getting structure file paths#
The PlinderSystem
also provides access to the structure files the system is based on.
This could be helpful for loading the structures for training a model or performing
other calculations that require structural information.
print(plinder_system.ligand_sdfs)
print(plinder_system.smiles)
{'1.W': '/home/runner/.local/share/plinder/2024-06/tutorial/systems/4agi__1__1.C__1.W/ligand_files/1.W.sdf'}
{'1.W': 'C[Se][C@@H]1O[C@@H](C)[C@@H](O)[C@@H](O)[C@@H]1O'}
The same can be done for the receptor protein.
print(plinder_system.receptor_pdb)
print(plinder_system.receptor_cif)
print(plinder_system.sequences)
/home/runner/.local/share/plinder/2024-06/tutorial/systems/4agi__1__1.C__1.W/receptor.pdb
/home/runner/.local/share/plinder/2024-06/tutorial/systems/4agi__1__1.C__1.W/receptor.cif
{'1.C': 'MSTPGAQQVLFRTGIAAVNSTNHLRVYFQDVYGSIRESLYEGSWANGTEKNVIGNAKLGSPVAATSKELKHIRVYTLTEGNTLQEFAYDSGTGWYNGGLGGAKFQVAPYSXIAAVFLAGTDALQLRIYAQKPDNTIQEYMWNGDGWKEGTNLGGALPGTGIGATSFRYTDYNGPSIRIWFQTDDLKLVQRAYDPHKGWYPDLVTIFDRAPPRTAIAATSFGAGNSSIYMRIYFVNSDNTIWQVCWDHGKGYHDKGTITPVIQGSEVAIISWGSFANNGPDLRLYFQNGTYISAVSEWVWNRAHGSQLGRSALPPA'}