Dataset#

Dataset reference#

Directory structure#

2024-06/
|-- v2
    |-- index # Consolidated tabular annotations
    |   |-- annotation_table.parquet
    |   |-- annotation_table_nonredundant.parquet
    |-- systems  # Structure files for all systems (split by `two_char_code` and zipped)
    |   |-- {two_char_code}.zip
    |-- clusters # Pre-calculated cluster labels derived from the protein similarity dataset
    |   |-- cluster=communities
    |       |-- ...
    |   |-- cluster=components
    |       |-- ...
    |-- splits # Split files and the configs used to generate them (if available)
    |   |-- split.parquet
    |   |-- split.yaml
    |-- linked_structures # Apo and predicted structures linked to their holo systems
    |   |-- {two_char_code}.zip
    |-- links # Apo and predicted structures similarity to their holo structures
    |   |-- apo_links.parquet
    |   |-- pred_links.parquet
    |
--------------------------------------------------------------------------------
                            miscellaneous data below
--------------------------------------------------------------------------------
    |
    |-- dbs # TSVs containing the raw files and IDs in the foldseek and mmseqs sub-databases
    |   |-- subdbs
    |       |-- apo.csv
    |       |-- holo.csv
    |       |-- pred.csv
    |-- entries # Raw annotations prior to consolidation (split by `two_char_code` and zipped)
    |   |-- {two_char_code}.zip
    |-- fingerprints # Index mapping files for the ligand similarity dataset
    |   |-- ligands_per_inchikey.parquet
    |   |-- ligands_per_inchikey_ecfp4.npy
    |   |-- ligands_per_system.parquet
    |-- ligand_scores # Ligand similarity parquet dataset
    |   |-- {hashid}.parquet
    |-- ligands # Ligand data expanded from entries for computing similarity
    |   |-- {hashid}.parquet
    |-- mmp # Ligand matched molecular pairs (MMP) and series (MMS) data
    |   |-- plinder_mmp_series.parquet
    |   |-- plinder_mms.csv.gz
    |-- scores # Protein similarity parquet dataset
    |   |-- search_db=apo
    |       |-- apo.parquet
    |   |-- search_db=holo
    |       |-- {chunck_id}.parquet
    |   |-- search_db=pred
    |       |-- pred.parquet

We will describe the content of the index, systems, clusters, splits, links and linked_structures directories in detail below, the rest are described in the miscellaneous section.

Annotation tables (index/)#

Tables that lists all systems along with their annotations.

  • annotation_table.parquet: Lists all systems and their annotations.

  • annotation_table_nonredundant.parquet: Subset of systems without redundant systems.

Name Type Description Mandatory Example
Loading ITables v2.1.5 from the internet... (need help?)

Mandatory: The column has a non-empty, non-NaN value in for all PLINDER systems. Example: An example non-empty, non-NaN value for the given column in a PLINDER system.

Systems (systems/)#

This directory contains all the systems used in the dataset. The systems are grouped into zipped subdirectories by using two penultimate characters of PDB code (two_char_code). The purpose of this grouping is to make loading and querying speed palatable.

Each unzipped subdirectory, contains folders named by system_id that contain the structure files.

|-- {two_char_code}
    |-- {system_id}
        |-- chain_mapping.json # Mapping between the chains in the receptor and the chains in the system
        |-- ligand_files # Mapping between the ligand in the receptor and the ligands in the system
        |-- receptor.cif  # Receptor mmcif file
        |-- receptor.pdb # Receptor pdb file
        |-- sequences.fasta # Receptor sequence fasta
        |-- system.cif # System mmcif file
        |-- water_mapping.json # Receptor binding site water map json file

Clusters (clusters/)#

This directory contains pre-calculated cluster labels derived from the protein and pocket similarity dataset. The nested structure is as follows:

|-- cluster=communities
    |-- directed=False
        |-- metric={metric}
            |-- threshold={threshold}.parquet
|-- cluster=components
    |-- directed=False
        |-- metric={metric}
            |-- threshold={threshold}.parquet
    |-- directed=True
        |-- metric={metric}
            |-- threshold={threshold}.parquet
  • cluster: the cluster algorithm used

    • communities: clusters derived from community detection algorithm

    • components: clusters derived from disconnected component of similarity graph

  • directed: type of graph used for cluster input

    • False: undirected

    • True: directed

  • metric: the similarity metrics used for generating the clusters

    • pli_qcov: Protein-ligand interaction similarity between aligned ligand-binding region (pocket) residues of two systems.

    • pli_unique_qcov: Protein-ligand interaction similarity between aligned pocket residues of two systems, taking only unique interaction type into consideration.

    • pocket_fident: Pocket region sequence identity of the ligand-binding (pocket) region of a system to a (possibly non-pocket) region of another system.

    • pocket_fident_qcov: Sequence identity between ligand binding region (pocket) of two systems.

    • pocket_lddt: Structural similarity between ligand-binding region (pocket) of a system to any region (possibly non-pocket) of another system.

    • pocket_lddt_qcov: Structural similarity between ligand-binding region (pocket) two systems.

    • pocket_qcov: Query coverage between ligand-binding region of two systems.

    • protein_fident_max: Local sequence identity between components of two systems, aggregated by max score across all pairs of protein chains or ligand chains.

    • protein_fident_qcov_max: Global protein sequence identity between components of two systems multiplied by query system coverage, aggregated by max score across all pairs of protein chains or ligand chains.

    • protein_fident_qcov_weighted_max: Global protein sequence identity between components of two systems, aggregated by length-weighted sum of scores across mapped protein or ligand chains.

    • protein_fident_qcov_weighted_sum: Global protein sequence identity between components of two systems, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.

    • protein_fident_weighted_max: Local sequence identity between components of two systems, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.

    • protein_fident_weighted_sum: Local sequence identity between components of two systems, aggregated by length-weighted sum of scores across mapped protein or ligand chains.

    • protein_lddt_max: Local structural similarity between chains of two systems, aggregated by max score across all pairs of protein chains or ligand chains.

    • protein_lddt_qcov_max: Global protein structural similarity multiplied by query system coverage, aggregated by max score across all pairs of protein chains or ligand chains.

    • protein_lddt_qcov_weighted_max: Global protein structural similarity multiplied by query system coverage, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.

    • protein_lddt_qcov_weighted_sum: Global protein structural similarity multiplied by query system coverage, aggregated by length-weighted sum of scores across mapped protein or ligand chains.

    • protein_lddt_weighted_max: Local structural similarity between chains of two systems, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.

    • protein_lddt_weighted_sum: Local structural similarity between chains of two systems, aggregated by length-weighted sum of scores across mapped protein or ligand chains.

    • protein_qcov_weighted_sum: Global protein query coverage, aggregated by length-weighted sum of scores across mapped protein or ligand chains.

    • protein_seqsim_max: Global protein sequence similarity between components of two systems, aggregated by max score across all pairs of protein chains or ligand chains.

    • protein_seqsim_qcov_max: Global protein sequence similarity between components of two systems multiplied by query system coverage, aggregated by max score across all pairs of protein chains or ligand chains.

    • protein_seqsim_qcov_weighted_max: Global protein sequence similarity between components of two systems multiplied by query system coverage, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.

    • protein_seqsim_qcov_weighted_sum: Global protein sequence similarity between components of two systems multiplied by query system coverage, aggregated by length-weighted sum of scores across mapped protein or ligand chains.

    • protein_seqsim_weighted_max: Global protein sequence similarity between components of two systems, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.

    • protein_seqsim_weighted_sum: Global protein sequence similarity between components of two systems, aggregated by length-weighted sum of scores across mapped protein or ligand chains.

  • threshold: similarity threshold in percent.

    • 50

    • 70

    • 95

    • 100

Splits (splits/)#

This directory contains split files and the configs used to generate them.

  • split.parquet: listing the split category for each system

  • split.yaml: the config used to generate the split

split.parquet#

Name

Type

Description

system_id

str

The PLINDER system ID

split

str

Split category: either train (training set), test (test set),val (training set) or removed (removed for de-leaking purposes)

cluster

str

Cluster label used in sampling test set

cluster_for_val_split

str

Cluster label used in sampling validation set.

uniqueness

str

system label used to remove redundant systems from the split

system_pass_validation_criteria

bool

does as system pass the crystal quality for test?

system_pass_statistics_criteria

bool

does a system fit the statistics criteria for test?

system_proper_num_ligand_chains

int

number of ligand entries in a system that are not classified as ion or artifact (i.e “proper” ligands)

system_proper_pocket_num_residues

int

total number of pocket residues that are within 6 Å distance to a “proper” ligand(s) in a system

system_proper_num_interactions

int

total number of PLI interactions to a “proper” ligand(s) in a system

system_proper_ligand_max_molecular_weight

float

maximum molecular weight of the “proper” ligand(s) in a system

system_has_binding_affinity

bool

does the system have a ligand with an annotated binding affinity?

system_has_apo_or_pred

bool

does the system have either apo or pred structure linked?

The content of split.yaml is described below:

split:
  graph_configs: # Similarity graph configuration
  - metric: pli_unique_qcov # Metric used to generate the base graph from which all partitioning is done.
    threshold: 30 # Threshold used to generate the base graph from which all partitioning is done.
    depth: 1 # Depth at which the neighbors are defined.
  - metric: protein_seqsim_weighted_sum # Same as above
    threshold: 30 # Same as above
    depth: 1 # Same as above
  mms_unique_quality_count: 3 # How many unique congeneric IDs passing quality to consider as MMS
  ligand_cluster_metric: Tanimoto_similarity_max # which metric to use for ligand clusters (these are added to test from removed if they are different from train/val and corresponding leaked systems are removed from train/val)
  ligand_cluster_threshold: 50 # Which threshold to use for ligand clusters

  ligand_cluster_cluster: components # Which cluster to use for ligand clusters
  test_cluster_cluster: communities # What kind of cluster to use for sampling test
  test_cluster_metric: pli_unique_qcov # Metric to use for sampling representatives from each test cluster
  test_cluster_threshold: 50  # Threshold to use for sampling representatives from each test cluster
  test_cluster_directed: false # Directed to use for sampling representatives from each test cluster
  num_test_representatives: 2 # Max number of representatives from each test cluster
  num_per_entry_pdb_id_and_unique_ccd_codes: 1 # Max number of systems to choose per entry pdb id and unique ccd codes
  min_test_cluster_size: 5 # Test should not be singletons
  min_test_leakage_count: 30  # Test should not be too unique
  max_test_leakage_count: 1000 # Test should not be in too big communities or cause too many train cases to be removed
  max_removed_fraction: 0.2 # Maximum fraction of systems that can be removed due to test set selection
  num_test: 1000 # test set size
  val_cluster_cluster: components # What kind of cluster to use for sampling val
  val_cluster_metric: pocket_qcov # Metric to use for splitting train and val
  val_cluster_threshold: 50  # Threshold to use for splitting train and val
  val_cluster_directed: false # Directed to use for splitting train and val
  num_val_representatives: 3 # Max number of representatives from each val cluster
  min_val_cluster_size: 30  # Val should not be singletons
  num_val: 1000  # Val set size
  min_max_pli: # Test/val should not have too few or too many interactions
  - 3
  - 50
  min_max_pocket: # Test/val should not have too few or too many pocket residues
  - 5
  - 100
  min_max_ligand: # Test/val should not have too small or too large ligands
  - 200
  - 800
  test_additional_criteria: # Priority columns to use for scoring systems with a weight attached to each column
  - - system_pass_validation_criteria # Indicator of whether a system is passing validation criteria
    - ==
    - 'True'
  - - system_pass_statistics_criteria # Indicator of whether a system is passing statistic criteria
    - ==
    - 'True'
  - - system_num_ligands_in_biounit # Number of ligands in the biounit.
    - <=
    - 20
  priority_columns:
    system_ligand_has_cofactor: -40.0
    leakage_count: -1.0

Linked structures (linked_structures/)#

This directory contains the linked apo and predicted structures for PLINDER systems. These structures are intended to be used for augmenting the PLINDER dataset, eg. for flexible docking or pocket prediction purposes. The files are grouped into zipped subdirectories by using two_char_code of the system. Each unzipped subdirectory contains pred and apo subfolders that in turn contain folders named by system_id. Inside each apo/{system_id} and pred/{system_id} folder is another directory containing a superposed system: {source_id}_{chain_id}/superposed.cif, where {source_id} and {chain_id} for apo systems is pdb_id with a source chain identifier, and for predicted structures, {source_id} is uniprot_id used in AF2DB with a chain identifier set to A.

Miscellaneous#

Here we briefly describe subdirectories and their files that are not part of the main dataset but are used in the dataset processing pipeline. These files should be considered intermediate products and are not intended to be used directly, only for development purposes.

Database processed files (dbs/)#

This directory contains the intermediate files of PDB structures that were successfully processed and scored by Foldseek and MMseqs2 pipeline. It is used in splitting to make sure that only successfully computed systems are used for splitting.

|-- subdbs
|   |-- apo.csv
|   |-- holo.csv
|   |-- pred.csv

Each file is a CSV with a single column: pdb_id.

Raw annotations (entries/)#

This directory contains intermediate raw annotation files prior to consolidation. The files are grouped into zipped subdirectories by using two_char_code. Each subdirectory, contains {pdb_id}.json files with raw annotations for every system found in given pdb_id.

Small molecule fingerprints (fingerprints/)#

Tables that contains all the ligand fingerprints used in calculating ligand similarity stored in ligand_scores.

  • ligands_per_inchikey_ecfp4.npy: numpy array of all-vs-all ECFP4 similarity.

  • ligands_per_system.parquet: table linking PLINDER systems to their ligands, including ligand ID, SMILES, InChIKey, etc.

  • ligands_per_inchikey.parquet: subset of ligands_per_system.parquet with reduced number of columns.

Small molecule data (ligands/)#

Ligand data expanded from entries for computing similarity, saved in distributed files {hashid}.parquet.

Eg.

  pdb_id              system_id                      ligand_rdkit_canonical_smiles ligand_ccd_code                   ligand_id                    inchikeys
0   7o00  7o00__1__1.A_1.B__1.D  CC(=O)N[C@H]1CO[C@H](CO)[C@@H](OC2O[C@H](CO)[C...         HSR-HSR  7o00__1__1.A_1.B__1.D__1.D  JHPFQHGUNGJQIZ-BQBDUENHSA-N
1   7o00  7o00__1__1.A_1.B__1.E  CC(=O)N[C@@H]1[C@@H](O)[C@H](O)[C@@H](CO)O[C@H]1O             HSR  7o00__1__1.A_1.B__1.E__1.E  OVRNDRQMDRJTHS-FMDGEEDCSA-N
2   7o04      7o04__1__1.A__1.G                        CNCc1cc([N+](=O)[O-])ccc1Cl             4AV      7o04__1__1.A__1.G__1.G  YRTNCUPHKWUHMQ-UHFFFAOYSA-N
3   7o08      7o08__1__1.A__1.C  CC1(C)CCN(Cc2ccc(NCC3(O)CCN(c4cc(NCc5ccccc5)nc...             UXE      7o08__1__1.A__1.C__1.C  GTLDMCHZRAFXCB-UHFFFAOYSA-N
4   7o09      7o09__1__1.A__1.C  CC1(C)CCN(Cc2ccc(N3CCOC4(CCN(c5cc(NCc6ccccc6)n...             UXK      7o09__1__1.A__1.C__1.C  RJEWLHZZXYDBNT-UHFFFAOYSA-N

Small molecule similarity scores (ligand_scores/)#

Tables that contains all the ligand similarity scores used in calculating the similarity between two ligands, saved in distributed files {hashid}.parquet.

Eg.

   query_ligand_id  target_ligand_id  tanimoto_similarity_max
0            35300              6943                      100
1            35300             35300                      100
2            35300             13911                       94
3            35300             44243                       90
4            35300             24003                       90

Small molecule matched molecular pairs (mmp/)#

Files that contains all the ligand matched molecular pairs (MMP) and matched molecular series (MMS).

  • plinder_mmp_series.parquet: matched molecular series (MMS) linked to PLINDER systems,

  • plinder_mms.csv.gz: compressed mmpdb index file containing the matched molecular pairs (MMP) of all ligands in PLINDER annotation table.

Protein similarity dataset (scores/)#

Tables that contains all the protein or pocket similarity scores used in calculating the similarity between two systems.

|-- search_db=apo
|   |-- apo.parquet
|-- search_db=holo
|   |-- {chunck_id}.parquet
|-- search_db=pred
|   |-- pred.parquet

All the parquet files have the save columns in the header. E.g

                    query_system target_system protein_mapping protein_mapper  ...    source                            metric  mapping search_db
1070886    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...    mmseqs         protein_qcov_weighted_max  1.A:0.A       apo
1070887    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...    mmseqs                  protein_qcov_max  1.A:0.A       apo
1070888    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...      both       protein_fident_weighted_max  1.A:0.A       apo
1070889    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...      both                protein_fident_max  1.A:0.A       apo
1070890    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...    mmseqs  protein_fident_qcov_weighted_max  1.A:0.A       apo
...                          ...           ...             ...            ...  ...       ...                               ...      ...       ...
213471528      7eek__1__1.A__1.I        1uor_A         1.A:0.A       foldseek  ...  foldseek    protein_lddt_qcov_weighted_max  1.A:0.A       apo
213471529      7eek__1__1.A__1.I        1uor_A         1.A:0.A       foldseek  ...  foldseek             protein_lddt_qcov_max  1.A:0.A       apo
213471536      7eek__1__1.A__1.I        1uor_A         1.A:0.A       foldseek  ...  foldseek                       pocket_lddt     None       apo
213471540      7eek__1__1.A__1.I        6zl1_A         1.A:0.A       foldseek  ...  foldseek                       pocket_lddt     None       apo
213471541      7eek__1__1.A__1.I        6zl1_B         1.A:0.B       foldseek  ...  foldseek                       pocket_lddt     None       apo
apo.parquet columns#

Name

Type

Description

query_system

str

The PLINDER system ID of query system

target_system

str

The PLINDER system ID of target system

protein_mapping

str

Chain mapping between query system and target system

protein_mapper

str

Alignment method used for mapping.

similarity

int

Similarity metric of interest

source

str

Source of similarity metric. It could either be foldseek, mmseqs or both

metric

str

Similarity metric of interest

mapping

str

Local region mapping between query system and target system

search_db

str

Search database type. Could be apo, holo or pred