---
sd_hide_title: true
---

# Evaluation tools

## Evaluating docking poses across a stratified test set

The `plinder.eval` subpackage allows (1) assessing protein-ligand complex predictions against reference `plinder` systems, and
(2) correlating the performance of these predictions against the level of similarity of each test system to the corresponding training set.

The output file from running the scripts `src/plinder/eval/docking/write_scores.py` and `src/plinder/eval/docking/stratify_test_set.py` generates the same evaluation metrics as the ones we have on the public leaderboard.

The `plinder-eval` package allows

1. assessing protein-ligand complex predictions against reference `plinder` systems, and
2. correlating the performance of these predictions against the level of similarity of
each test system to the corresponding training set.

The output files from running `plinder-eval` will be used to populate the [MLSB leaderboard](https://www.mlsb.io/#challenge).

### Input

`predictions.csv` with each row representating a protein-ligand pose, and the following columns:

- `id`: An identifier for the prediction (same across different ranked poses of the same prediction)
- `reference_system_id`: `plinder` system ID to use as reference
- `receptor_file`: Path to protein CIF file. Leave blank if rigid docking, the system's receptor file will be used.
- `rank`: The rank of the pose (1-indexed)
- `confidence`: Optional score associated with the pose
- `ligand_file`: Path to the pose SDF file or directory of SDF files for multi ligand poses

`split.parquet` with, at a minimum, `system_id` and `split` columns mapping PLINDER systems to `train`, or `test`.

### Commands

#### Write scores

```bash
plinder_eval --prediction_file tests/test_data/eval/predictions.csv --output_dir test_eval/ --num_processes 8
```

This calculates accuracy metrics for all predicted poses compared to the reference. JSON files of each pose are stored in `test_eval/scores` and the summary file across all poses is stored in `test_eval/scores.parquet`.

The predicted pose is compared to the reference system and the following ligand scores are calculated per each ligand:
- `lddt_pli`: lDDT-PLI for the matched ligand
- `bisy_rmsd`: binding-site superposed symmetry-corrected RMSD for the matched ligand
- `lddt_lp`: lDDT score for the residues in the matched ligand pocket
- `best_rmsd_matched_reference_chain`: chain tag of the best `bisy_rmsd` matched ligand chain in reference (useful for mutli ligand systems)
- `best_pli_matched_reference_chain`: chain tag of the best `lddt_pli` matched ligand chain in reference (useful for mutli ligand systems)

and in aggregate per system:
- `fraction_reference_ligands_mapped`: Fraction of reference ligand chains with corresponding model chains
- `fraction_model_ligands_mapped`: Fraction of model ligand chains mapped to corresponding reference chains
- `lddt_pli_ave`: average lDDT-PLI across mapped ligands
- `lddt_pli_wave`: average lDDT-PLI across mapped ligands weighted by number of atoms
- `lddt_lp_ave`: average lDDT-LP score for the residues in the matched ligand pocket
- `lddt_lp_wave`: average lDDT-LP across mapped ligands weighted by number of atoms
- `bisy_rmsd_ave`: average binding-site superposed symmetry-corrected RMSD across mapped ligands
- `bisy_rmsd_wave`: average binding-site superposed symmetry-corrected RMSD across mapped ligands weighted by number of atoms

If `--score_receptor` flag is used, then protein in `receptor_file` is compared to the `reference` system receptor file and the following scores are calculated:

- `fraction_reference_proteins_mapped`: Fraction of reference protein chains with corresponding model chains
- `fraction_model_proteins_mapped`: Fraction of model protein chains mapped to corresponding reference chains
- `lddt`: all atom lDDT
- `bb_lddt`: CA lDDT
- `per_chain_lddt_ave`: average all atom lDDT across all mapped chains
- `per_chain_lddt_wave`: average all atom lDDT across all mapped chains weighted by chain length
- `per_chain_bb_lddt_ave`: average CA lDDT across all mapped chains
- `per_chain_bb_lddt_wave`: average CA lDDT across all mapped chains weighted by chain length

For oligomeric complexes:

- `qs_global` - Global QS score
- `qs_best` - Global QS-score - only computed on aligned residues
- `dockq_ave` - Average of DockQ scores
- `dockq_wave` - Same as dockq_ave, weighted by native contacts

If `score_posebusters` is True, all posebusters checks are saved.

You can inspect the results at `test_eval/scores.parquet`

```python
>>> import pandas as pd
>>> df = pd.read_parquet("test_eval/scores.parquet")
>>> df.T
                                                   0                      1
model                              1a3b__1__1.B__1.D  1ai5__1__1.A_1.B__1.D
reference                          1a3b__1__1.B__1.D  1ai5__1__1.A_1.B__1.D
num_reference_ligands                              1                      1
num_model_ligands                                  1                      1
num_reference_proteins                             1                      2
num_model_proteins                                 1                      2
fraction_reference_ligands_mapped                1.0                    1.0
fraction_model_ligands_mapped                    1.0                    1.0
lddt_pli_ave                                 0.85815               0.510695
lddt_pli_wave                                0.85815               0.510695
bisy_rmsd_ave                               1.617184               3.665143
bisy_rmsd_wave                              1.617184               3.665143
rank                                               1                      1
```

#### Write test stratification data

(This command will not need to be run by a user, the `test_set.parquet` and `val_set.parquet` file will be provided with the split release)

```bash
plinder_stratify --split_file split.csv --output_dir test_data
```

Makes `test_data/test_set.parquet` which

- Labels the maximum similarity of each test system to the training set across all the similarity metrics
- Stratifies the test set based on training set similarity into `novel_pocket_pli`, `novel_ligand_pli`, `novel_protein`, `novel_ligand`, `novel_all` and `not_novel`
- Labels test systems with high quality.

To inspect the result of the run, do:
```python
>>> import pandas as pd
>>> df = pd.read_parquet("test_eval/test_set.parquet")
>>> df.T
                                                  0                      1
system_id                         1a3b__1__1.B__1.D  1ai5__1__1.A_1.B__1.D
pli_qcov                                        0.0                    0.0
protein_seqsim_qcov_weighted_sum                0.0                    0.0
protein_seqsim_weighted_sum                     0.0                    0.0
protein_fident_qcov_weighted_sum                0.0                    0.0
protein_fident_weighted_sum                     0.0                    0.0
protein_lddt_qcov_weighted_sum                  0.0                    0.0
protein_lddt_weighted_sum                       0.0                    0.0
protein_qcov_weighted_sum                       0.0                    0.0
pocket_fident_qcov                              0.0                    0.0
pocket_fident                                   0.0                    0.0
pocket_lddt_qcov                                0.0                    0.0
pocket_lddt                                     0.0                    0.0
pocket_qcov                                     0.0                    0.0
tanimoto_similarity_max                         0.0                    0.0
passes_quality                                False                  False
novel_pocket_pli                               True                   True
novel_ligand                                   True                   True
novel_protein                                  True                   True
novel_all                                      True                   True
not_novel                                     False                  False
>>>
```