Dataset#

Dataset reference#

Directory structure#

2024-06/
|-- v2
    |-- index # Consolidated tabular annotations
    |   |-- annotation_table.parquet
    |   |-- annotation_table_nonredundant.parquet
    |-- systems  # Structure files for all systems (split by `two_char_code` and zipped)
    |   |-- {two_char_code}.zip
    |-- clusters # Pre-calculated cluster labels derived from the protein similarity dataset
    |   |-- cluster=communities
    |       |-- ...
    |   |-- cluster=components
    |       |-- ...
    |-- splits # Split files and the configs used to generate them (if available)
    |   |-- split.parquet
    |   |-- split.yaml
    |-- linked_structures # Apo and predicted structures linked to their holo systems
    |   |-- {two_char_code}.zip
    |-- links # Apo and predicted structures similarity to their holo structures
    |   |-- apo_links.parquet
    |   |-- pred_links.parquet
    |
--------------------------------------------------------------------------------
                            miscellaneous data below
--------------------------------------------------------------------------------
    |
    |-- dbs # TSVs containing the raw files and IDs in the foldseek and mmseqs sub-databases
    |   |-- subdbs
    |       |-- apo.csv
    |       |-- holo.csv
    |       |-- pred.csv
    |-- entries # Raw annotations prior to consolidation (split by `two_char_code` and zipped)
    |   |-- {two_char_code}.zip
    |-- fingerprints # Index mapping files for the ligand similarity dataset
    |   |-- ligands_per_inchikey.parquet
    |   |-- ligands_per_inchikey_ecfp4.npy
    |   |-- ligands_per_system.parquet
    |-- ligand_scores # Ligand similarity parquet dataset
    |   |-- {hashid}.parquet
    |-- ligands # Ligand data expanded from entries for computing similarity
    |   |-- {hashid}.parquet
    |-- mmp # Ligand matched molecular pairs (MMP) and series (MMS) data
    |   |-- plinder_mmp_series.parquet
    |   |-- plinder_mms.csv.gz
    |-- scores # Protein similarity parquet dataset
    |   |-- search_db=apo
    |       |-- apo.parquet
    |   |-- search_db=holo
    |       |-- {chunck_id}.parquet
    |   |-- search_db=pred
    |       |-- pred.parquet

We will describe the content of the index, systems, clusters, splits, links and linked_structures directories in detail below, the rest are described in the miscellaneous section.

Annotation tables (`index/`)#

Tables that lists all systems along with their annotations.

annotation_table.parquet: Lists all systems and their annotations.
annotation_table_nonredundant.parquet: Subset of systems without redundant systems.

Name	Type	Description	Mandatory	Example
Loading ITables v2.2.4 from the internet... (need help?)

Mandatory: The column has a non-empty, non-NaN value in for all PLINDER systems. Example: An example non-empty, non-NaN value for the given column in a PLINDER system.

Systems (`systems/`)#

This directory contains all the systems used in the dataset. The systems are grouped into zipped subdirectories by using two penultimate characters of PDB code (two_char_code). The purpose of this grouping is to make loading and querying speed palatable.

Each unzipped subdirectory, contains folders named by system_id that contain the structure files.

|-- {two_char_code}
    |-- {system_id}
        |-- chain_mapping.json # Mapping between the chains in the receptor and the chains in the system
        |-- ligand_files # Mapping between the ligand in the receptor and the ligands in the system
        |-- receptor.cif  # Receptor mmcif file
        |-- receptor.pdb # Receptor pdb file
        |-- sequences.fasta # Receptor sequence fasta
        |-- system.cif # System mmcif file
        |-- water_mapping.json # Receptor binding site water map json file

Clusters (`clusters/`)#

This directory contains pre-calculated cluster labels derived from the protein and pocket similarity dataset. The nested structure is as follows:

|-- cluster=communities
    |-- directed=False
        |-- metric={metric}
            |-- threshold={threshold}.parquet
|-- cluster=components
    |-- directed=False
        |-- metric={metric}
            |-- threshold={threshold}.parquet
    |-- directed=True
        |-- metric={metric}
            |-- threshold={threshold}.parquet

cluster: the cluster algorithm used
- communities: clusters derived from community detection algorithm
- components: clusters derived from disconnected component of similarity graph
directed: type of graph used for cluster input
- False: undirected
- True: directed
metric: the similarity metrics used for generating the clusters
- pli_qcov: Protein-ligand interaction similarity between aligned ligand-binding region (pocket) residues of two systems.
- pli_unique_qcov: Protein-ligand interaction similarity between aligned pocket residues of two systems, taking only unique interaction type into consideration.
- pocket_fident: Pocket region sequence identity of the ligand-binding (pocket) region of a system to a (possibly non-pocket) region of another system.
- pocket_fident_qcov: Sequence identity between ligand binding region (pocket) of two systems.
- pocket_lddt: Structural similarity between ligand-binding region (pocket) of a system to any region (possibly non-pocket) of another system.
- pocket_lddt_qcov: Structural similarity between ligand-binding region (pocket) two systems.
- pocket_qcov: Query coverage between ligand-binding region of two systems.
- protein_fident_max: Local sequence identity between components of two systems, aggregated by max score across all pairs of protein chains or ligand chains.
- protein_fident_qcov_max: Global protein sequence identity between components of two systems multiplied by query system coverage, aggregated by max score across all pairs of protein chains or ligand chains.
- protein_fident_qcov_weighted_max: Global protein sequence identity between components of two systems, aggregated by length-weighted sum of scores across mapped protein or ligand chains.
- protein_fident_qcov_weighted_sum: Global protein sequence identity between components of two systems, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.
- protein_fident_weighted_max: Local sequence identity between components of two systems, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.
- protein_fident_weighted_sum: Local sequence identity between components of two systems, aggregated by length-weighted sum of scores across mapped protein or ligand chains.
- protein_lddt_max: Local structural similarity between chains of two systems, aggregated by max score across all pairs of protein chains or ligand chains.
- protein_lddt_qcov_max: Global protein structural similarity multiplied by query system coverage, aggregated by max score across all pairs of protein chains or ligand chains.
- protein_lddt_qcov_weighted_max: Global protein structural similarity multiplied by query system coverage, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.
- protein_lddt_qcov_weighted_sum: Global protein structural similarity multiplied by query system coverage, aggregated by length-weighted sum of scores across mapped protein or ligand chains.
- protein_lddt_weighted_max: Local structural similarity between chains of two systems, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.
- protein_lddt_weighted_sum: Local structural similarity between chains of two systems, aggregated by length-weighted sum of scores across mapped protein or ligand chains.
- protein_qcov_weighted_sum: Global protein query coverage, aggregated by length-weighted sum of scores across mapped protein or ligand chains.
- protein_seqsim_max: Global protein sequence similarity between components of two systems, aggregated by max score across all pairs of protein chains or ligand chains.
- protein_seqsim_qcov_max: Global protein sequence similarity between components of two systems multiplied by query system coverage, aggregated by max score across all pairs of protein chains or ligand chains.
- protein_seqsim_qcov_weighted_max: Global protein sequence similarity between components of two systems multiplied by query system coverage, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.
- protein_seqsim_qcov_weighted_sum: Global protein sequence similarity between components of two systems multiplied by query system coverage, aggregated by length-weighted sum of scores across mapped protein or ligand chains.
- protein_seqsim_weighted_max: Global protein sequence similarity between components of two systems, aggregated by length-weighted max score across all pairs of protein chains or ligand chains.
- protein_seqsim_weighted_sum: Global protein sequence similarity between components of two systems, aggregated by length-weighted sum of scores across mapped protein or ligand chains.
threshold: similarity threshold in percent.
- …
- 50
- 70
- 95
- 100

Splits (`splits/`)#

This directory contains split files and the configs used to generate them.

split.parquet: listing the split category for each system
split.yaml: the config used to generate the split

`split.parquet`#
Name	Type	Description
system_id	str	The PLINDER system ID
split	str	Split category: either `train` (training set), `test` (test set),`val` (training set) or `removed` (removed for de-leaking purposes)
cluster	str	Cluster label used in sampling test set
cluster_for_val_split	str	Cluster label used in sampling validation set.
uniqueness	str	system label used to remove redundant systems from the split
system_pass_validation_criteria	bool	does as system pass the crystal quality for test?
system_pass_statistics_criteria	bool	does a system fit the statistics criteria for test?
system_proper_num_ligand_chains	int	number of ligand entries in a system that are not classified as ion or artifact (i.e “proper” ligands)
system_proper_pocket_num_residues	int	total number of pocket residues that are within 6 Å distance to a “proper” ligand(s) in a system
system_proper_num_interactions	int	total number of PLI interactions to a “proper” ligand(s) in a system
system_proper_ligand_max_molecular_weight	float	maximum molecular weight of the “proper” ligand(s) in a system
system_has_binding_affinity	bool	does the system have a ligand with an annotated binding affinity?
system_has_apo_or_pred	bool	does the system have either `apo` or `pred` structure linked?

The content of split.yaml is described below:

split:
  graph_configs: # Similarity graph configuration
  - metric: pli_unique_qcov # Metric used to generate the base graph from which all partitioning is done.
    threshold: 30 # Threshold used to generate the base graph from which all partitioning is done.
    depth: 1 # Depth at which the neighbors are defined.
  - metric: protein_seqsim_weighted_sum # Same as above
    threshold: 30 # Same as above
    depth: 1 # Same as above
  mms_unique_quality_count: 3 # How many unique congeneric IDs passing quality to consider as MMS
  ligand_cluster_metric: Tanimoto_similarity_max # which metric to use for ligand clusters (these are added to test from removed if they are different from train/val and corresponding leaked systems are removed from train/val)
  ligand_cluster_threshold: 50 # Which threshold to use for ligand clusters

  ligand_cluster_cluster: components # Which cluster to use for ligand clusters
  test_cluster_cluster: communities # What kind of cluster to use for sampling test
  test_cluster_metric: pli_unique_qcov # Metric to use for sampling representatives from each test cluster
  test_cluster_threshold: 50  # Threshold to use for sampling representatives from each test cluster
  test_cluster_directed: false # Directed to use for sampling representatives from each test cluster
  num_test_representatives: 2 # Max number of representatives from each test cluster
  num_per_entry_pdb_id_and_unique_ccd_codes: 1 # Max number of systems to choose per entry pdb id and unique ccd codes
  min_test_cluster_size: 5 # Test should not be singletons
  min_test_leakage_count: 30  # Test should not be too unique
  max_test_leakage_count: 1000 # Test should not be in too big communities or cause too many train cases to be removed
  max_removed_fraction: 0.2 # Maximum fraction of systems that can be removed due to test set selection
  num_test: 1000 # test set size
  val_cluster_cluster: components # What kind of cluster to use for sampling val
  val_cluster_metric: pocket_qcov # Metric to use for splitting train and val
  val_cluster_threshold: 50  # Threshold to use for splitting train and val
  val_cluster_directed: false # Directed to use for splitting train and val
  num_val_representatives: 3 # Max number of representatives from each val cluster
  min_val_cluster_size: 30  # Val should not be singletons
  num_val: 1000  # Val set size
  min_max_pli: # Test/val should not have too few or too many interactions
  - 3
  - 50
  min_max_pocket: # Test/val should not have too few or too many pocket residues
  - 5
  - 100
  min_max_ligand: # Test/val should not have too small or too large ligands
  - 200
  - 800
  test_additional_criteria: # Priority columns to use for scoring systems with a weight attached to each column
  - - system_pass_validation_criteria # Indicator of whether a system is passing validation criteria
    - ==
    - 'True'
  - - system_pass_statistics_criteria # Indicator of whether a system is passing statistic criteria
    - ==
    - 'True'
  - - biounit_num_ligands # Number of ligands in the biounit.
    - <=
    - 20
  priority_columns:
    system_ligand_has_cofactor: -40.0
    leakage_count: -1.0

Linked structures (`linked_structures/`)#

This directory contains the linked apo and predicted structures for PLINDER systems. These structures are intended to be used for augmenting the PLINDER dataset, eg. for flexible docking or pocket prediction purposes. The files are grouped into zipped subdirectories by using two_char_code of the system. Each unzipped subdirectory contains pred and apo subfolders that in turn contain folders named by system_id. Inside each apo/{system_id} and pred/{system_id} folder is another directory containing a superposed system: {source_id}_{chain_id}/superposed.cif, where {source_id} and {chain_id} for apo systems is pdb_id with a source chain identifier, and for predicted structures, {source_id} is uniprot_id used in AF2DB with a chain identifier set to A.

Linked systems (`links/`)#

This directory contains parquet files linking PLINDER systems to their apo and predicted structures in linked_structures/.

`{apo|pred}_links.parquet`#
Name	Type	Description
reference_system_id	str	The PLINDER system ID
id	str	The PDB or AF2DB (for `apo` and `pred`, respectively) `{source_id}_{chain_id}` tag
pocket_fident	float	sequence identity for pocket residues
pocket_lddt	float	Local Distance Difference Test (lDDT) score for the pocket residue alpha carbons as returned by Foldseek.
protein_fident_qcov_weighted_sum	float	Sum of fident * qcov for all templates, weighted by the number of residues in the template
protein_fident_weighted_sum	float	Sum of fident for all templates, weighted by the number of residues in the template
protein_lddt_weighted_sum	float	Sum of lDDT for all residues, weighted by the number of residues in the template
target_id	str	apo or pred stucture `{source_id}` tag
sort_score	float	Score used to sort linked structures. This is resolution for apos and plddt for preds.
receptor_file	str	intermediate aligned linked receptor file path
ligand_files	str	intermediate file path for ligands used in calculations
num_reference_ligands	int	number of ligands in reference structure
num_model_ligands	int	number of ligands in model structure
num_reference_proteins	int	number of protein chains in reference structure
num_model_proteins	int	number of protein chains in model structure
fraction_reference_ligands_mapped	float	Fraction of reference ligands that were successfully mapped to model ligands
fraction_model_ligands_mapped	float	Fraction of model ligands that were successfully mapped to reference ligands
lddt_pli_ave	float	Average lDDT score for protein-ligand interactions
lddt_pli_wave	float	Weighted average lDDT score for protein-ligand interactions
bisy_rmsd_ave	float	Average binding-site superposed symmetry-corrected RMSD between reference and model ligands
bisy_rmsd_wave	float	Weighted average binding-site superposed symmetry-corrected RMSD between reference and model ligands
lddt_lp_ave	float	Average lDDT score for ligand poses
lddt_lp_wave	float	Weighted average lDDT score for ligand poses
posebusters_mol_pred_loaded	bool	PoseBusters metric: boolean indicator of whether the predicted ligand could be loaded
posebusters_mol_cond_loaded	bool	PoseBusters metric: boolean indicator of whether the conditional ligand could be loaded
posebusters_sanitization	bool	PoseBusters metric: boolean indicator of whether the ligand could be sanitized
posebusters_all_atoms_connected	bool	PoseBusters metric: boolean indicator of whether all atoms in the ligand are connected
posebusters_bond_lengths	bool	PoseBusters metric: boolean indicator of whether all bond lengths in the ligand are within 4 standard deviations of the mean
posebusters_bond_angles	bool	PoseBusters metric: boolean indicator of whether all bond angles in the ligand are within 4 standard deviations of the mean
posebusters_internal_steric_clash	bool	PoseBusters metric: boolean indicator of whether there are no internal steric clashes in the ligand
posebusters_aromatic_ring_flatness	bool	PoseBusters metric: boolean indicator of whether all aromatic rings in the ligand are flat
posebusters_double_bond_flatness	bool	PoseBusters metric: boolean indicator of whether all double bonds in the ligand are flat
posebusters_internal_energy	bool	PoseBusters metric: boolean indicator of whether the internal energy of the ligand is below 0 kcal/mol
posebusters_protein-ligand_maximum_distance	bool	PoseBusters metric: boolean indicator of whether the maximum distance between the ligand and the protein is less than 5 Angstrom
posebusters_minimum_distance_to_protein	bool	PoseBusters metric: boolean indicator of whether the minimum distance between the ligand and the protein is greater than 1.5 Angstrom
posebusters_minimum_distance_to_organic_cofactors	float	PoseBusters metric: Minimum distance between the ligand and any organic cofactor
posebusters_minimum_distance_to_inorganic_cofactors	bool	PoseBusters metric: Minimum distance between the ligand and any inorganic cofactor
posebusters_minimum_distance_to_waters	float	PoseBusters metric: Minimum distance between the ligand and any water molecule
posebusters_volume_overlap_with_protein	float	PoseBusters metric: Fraction of ligand volume that overlaps with the protein
posebusters_volume_overlap_with_organic_cofactors	bool	PoseBusters metric: boolean indicator of whether the share of ligand volume that intersects with the organic cofactor is less than 7.5%. The volumes are defined by the van der Waals radii around the heavy atoms scaled by 0.8.
posebusters_volume_overlap_with_inorganic_cofactors	bool	PoseBusters metric: boolean indicator of whether the share of ligand volume that intersects with the inorganic cofactor is less than 7.5%. The volumes are defined by the van der Waals radii around the heavy atoms scaled by 0.8.
posebusters_volume_overlap_with_waters	bool	PoseBusters metric: boolean indicator of whether the share of ligand volume that intersects with the linked system waters is less than 7.5%. The volumes are defined by the van der Waals radii around the heavy atoms scaled by 0.8.
fraction_reference_proteins_mapped	float	Fraction of reference protein chains with corresponding model chains
fraction_model_proteins_mapped	float	Fraction of model protein chains mapped to corresponding reference chains
lddt	float	Global lDDT score calculated over all atoms in the structure
bb_lddt	float	Global lDDT score calculated over backbone atoms (N, CA, C, O) in the structure
per_chain_lddt_ave	float	Average per-chain lDDT score calculated over all atoms
per_chain_bb_lddt_ave	float	Average per-chain lDDT score calculated over backbone atoms (N, CA, C, O)

Miscellaneous#

Here we briefly describe subdirectories and their files that are not part of the main dataset but are used in the dataset processing pipeline. These files should be considered intermediate products and are not intended to be used directly, only for development purposes.

Database processed files (`dbs/`)#

This directory contains the intermediate files of PDB structures that were successfully processed and scored by Foldseek and MMseqs2 pipeline. It is used in splitting to make sure that only successfully computed systems are used for splitting.

|-- subdbs
|   |-- apo.csv
|   |-- holo.csv
|   |-- pred.csv

Each file is a CSV with a single column: pdb_id.

Raw annotations (`entries/`)#

This directory contains intermediate raw annotation files prior to consolidation. The files are grouped into zipped subdirectories by using two_char_code. Each subdirectory, contains {pdb_id}.json files with raw annotations for every system found in given pdb_id.

Small molecule fingerprints (`fingerprints/`)#

Tables that contains all the ligand fingerprints used in calculating ligand similarity stored in ligand_scores.

ligands_per_inchikey_ecfp4.npy: numpy array of all-vs-all ECFP4 similarity.
ligands_per_system.parquet: table linking PLINDER systems to their ligands, including ligand ID, SMILES, InChIKey, etc.
ligands_per_inchikey.parquet: subset of ligands_per_system.parquet with reduced number of columns.

Small molecule data (`ligands/`)#

Ligand data expanded from entries for computing similarity, saved in distributed files {hashid}.parquet.

Eg.

  pdb_id              system_id                      ligand_rdkit_canonical_smiles ligand_ccd_code                   ligand_id                    inchikeys
 7o00  7o00__1__1.A_1.B__1.D  CC(=O)N[C@H]1CO[C@H](CO)[C@@H](OC2O[C@H](CO)[C...         HSR-HSR  7o00__1__1.A_1.B__1.D__1.D  JHPFQHGUNGJQIZ-BQBDUENHSA-N
 7o00  7o00__1__1.A_1.B__1.E  CC(=O)N[C@@H]1[C@@H](O)[C@H](O)[C@@H](CO)O[C@H]1O             HSR  7o00__1__1.A_1.B__1.E__1.E  OVRNDRQMDRJTHS-FMDGEEDCSA-N
 7o04      7o04__1__1.A__1.G                        CNCc1cc([N+](=O)[O-])ccc1Cl             4AV      7o04__1__1.A__1.G__1.G  YRTNCUPHKWUHMQ-UHFFFAOYSA-N
 7o08      7o08__1__1.A__1.C  CC1(C)CCN(Cc2ccc(NCC3(O)CCN(c4cc(NCc5ccccc5)nc...             UXE      7o08__1__1.A__1.C__1.C  GTLDMCHZRAFXCB-UHFFFAOYSA-N
 7o09      7o09__1__1.A__1.C  CC1(C)CCN(Cc2ccc(N3CCOC4(CCN(c5cc(NCc6ccccc6)n...             UXK      7o09__1__1.A__1.C__1.C  RJEWLHZZXYDBNT-UHFFFAOYSA-N

Small molecule similarity scores (`ligand_scores/`)#

Tables that contains all the ligand similarity scores used in calculating the similarity between two ligands, saved in distributed files {hashid}.parquet.

Eg.

   query_ligand_id  target_ligand_id  tanimoto_similarity_max
          35300              6943                      100
          35300             35300                      100
          35300             13911                       94
          35300             44243                       90
          35300             24003                       90

Small molecule matched molecular pairs (`mmp/`)#

Files that contains all the ligand matched molecular pairs (MMP) and matched molecular series (MMS).

plinder_mmp_series.parquet: matched molecular series (MMS) linked to PLINDER systems,
plinder_mms.csv.gz: compressed mmpdb index file containing the matched molecular pairs (MMP) of all ligands in PLINDER annotation table.

Protein similarity dataset (`scores/`)#

Tables that contains all the protein or pocket similarity scores used in calculating the similarity between two systems.

|-- search_db=apo
|   |-- apo.parquet
|-- search_db=holo
|   |-- {chunck_id}.parquet
|-- search_db=pred
|   |-- pred.parquet

All the parquet files have the save columns in the header. E.g

                    query_system target_system protein_mapping protein_mapper  ...    source                            metric  mapping search_db
1070886    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...    mmseqs         protein_qcov_weighted_max  1.A:0.A       apo
1070887    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...    mmseqs                  protein_qcov_max  1.A:0.A       apo
1070888    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...      both       protein_fident_weighted_max  1.A:0.A       apo
1070889    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...      both                protein_fident_max  1.A:0.A       apo
1070890    1b5d__1__1.A_1.B__1.D        1b49_A         1.A:0.A       foldseek  ...    mmseqs  protein_fident_qcov_weighted_max  1.A:0.A       apo
...                          ...           ...             ...            ...  ...       ...                               ...      ...       ...
213471528      7eek__1__1.A__1.I        1uor_A         1.A:0.A       foldseek  ...  foldseek    protein_lddt_qcov_weighted_max  1.A:0.A       apo
213471529      7eek__1__1.A__1.I        1uor_A         1.A:0.A       foldseek  ...  foldseek             protein_lddt_qcov_max  1.A:0.A       apo
213471536      7eek__1__1.A__1.I        1uor_A         1.A:0.A       foldseek  ...  foldseek                       pocket_lddt     None       apo
213471540      7eek__1__1.A__1.I        6zl1_A         1.A:0.A       foldseek  ...  foldseek                       pocket_lddt     None       apo
213471541      7eek__1__1.A__1.I        6zl1_B         1.A:0.B       foldseek  ...  foldseek                       pocket_lddt     None       apo

`apo.parquet` columns#
Name	Type	Description
query_system	str	The PLINDER system ID of query system
target_system	str	The PLINDER system ID of target system
protein_mapping	str	Chain mapping between query system and target system
protein_mapper	str	Alignment method used for mapping.
similarity	int	Similarity metric of interest
source	str	Source of similarity metric. It could either be `foldseek`, `mmseqs` or `both`
metric	str	Similarity metric of interest
mapping	str	Local region mapping between query system and target system
search_db	str	Search database type. Could be `apo`, `holo` or `pred`

Dataset#

Dataset reference#

Directory structure#

Annotation tables (index/)#

Systems (systems/)#

Clusters (clusters/)#

Splits (splits/)#

Linked structures (linked_structures/)#

Linked systems (links/)#

Miscellaneous#

Database processed files (dbs/)#

Raw annotations (entries/)#

Small molecule fingerprints (fingerprints/)#

Small molecule data (ligands/)#

Small molecule similarity scores (ligand_scores/)#

Small molecule matched molecular pairs (mmp/)#

Protein similarity dataset (scores/)#