Pipeline#

We outline conceptually the steps of the end-to-end pipeline in the following sections. We briefly describe some of the abstractions that are used to orchestrate the entire pipeline, but they are to be considered an implementation detail because they rely on our choice of orchestration framework for job execution.

workflow

Ingestion#

The code to obtain the raw data sources used in plinder are housed in the plinder.data.pipeline.io package and are invoked in our end-to-end pipeline through task wrappers in plinder.data.pipeline.tasks.

  • tasks.download_rcsb_files: uses the RCSB rsync API to download the majority of the raw data

    • This is a distributed task that is called in parallel for chunks of two character codes

    • It syncs both the next-gen cif.gz and validation xml.gz files for all entries

    • Side effects include writing the following files:

      • ingest/{two_char_code}/{full_pdb_id}/{full_pdb_id}-enrich.cif.gz

      • reports/{two_char_code}/{pdb_id}/{pdb_id}_validation.xml.gz

  • tasks.download_alternative_datasets: download all the alternative datasets used to enrich plinder

    • This is a task that is called once but reaches out to numerous external REST APIs

    • This could be threaded and arguably the alphafold sync could be its own task

    • Side effects include writing the following files:

      • dbs/alphafold/AF-{uniprod_id}-F1-model_v4.cif

      • dbs/cofactors/cofactors.json

      • dbs/components/components.parquet

      • dbs/ecod/ecod.parquet

      • dbs/kinase/kinase_information.parquet

      • dbs/kinase/kinase_ligand_ccd_codes.parquet

      • dbs/kinase/kinase_uniprotac.parquet

      • dbs/kinase/kinase_structures.parquet

      • dbs/panther/panther_{i}.parquet

      • dbs/seqres/pdb_seqres.txt.gz

Database creation#

Once the raw data is downloaded, we need to create the foldseek and mmseqs databases to be used as a basis for the similarity datasets.

  • tasks.make_dbs: creates the foldseek and mmseqs databases

    • This is a task that is called once

    • It uses the cif.gz data to create the foldseek database

    • It uses the pdb_seqres.txt.gz data to create the mmseqs database (obtained in download_alternative_datasets)

Annotation generation#

Once the raw data is downloaded, we can start generating the annotation data. Technically, this could run in parallel with the database creation, but it this task is already heavily distributed and it would add complexity to the DAG.

  • tasks.make_entries: creates the raw_entries data

    • This is a distributed task that is called in parallel for chunks of PDB IDs

    • It uses the cif.gz and xml.gz data in Entry.from_cif_file

    • It additionally uses the following alternative datasets:

      • ECOD

      • Panther

      • Kinase

      • Cofactors

      • Components

    • Side effects include writing the following files:

      • raw_entries/{two_char_code}/{pdb_id}.json

      • raw_entries/{two_char_code}/{system_id}/**

Structure quality checks#

After raw annotation generation, we run a series of quality checks on the generated data and do some consolidation and organization of the generated data.

  • tasks.structure_qc: runs the structure quality checks

    • This is a distributed task that is called in parallel for chunks of two character codes

    • It reads in the JSON files in raw_entries

    • Side effects include writing the following files:

      • qc/index/{two_char_code}.parquet - consolidated annotations

      • qc/logs/{two_char_code}_qc_fails.csv - list of entries that failed QC

      • entries/{two_char_code}.zip - zipped JSON entries

Structure archives#

The amount of structural data generated in make_entries is large and consolidated separately in its own task.

  • tasks.make_system_archives: creates the structure archives

    • This is a distributed task that is called in parallel for chunks of two character codes

    • It consolidates the structure files into zip archives in the same layout as for entries

    • The inner structure of each structure zip is grouped by system ID

    • Side effects include writing the following files:

      • archives/{two_char_code}.zip - zipped structure files

Ligand Similarity#

Once the plinder systems have been generated by make_entries, we can enumerate the small molecule ligands in the dataset.

  • tasks.make_ligands: creates the ligands data

    • This is a distributed task that is called in parallel for chunks of PDB IDs

    • It filters out ligands that are acceptable for use in ligand similarity

    • It uses the JSON files from raw_entries

    • Side effects include writing the following files:

      • ligands/{chunk_hash}.parquet

  • tasks.compute_ligand_fingerprints: computes the ligand fingerprints

    • This is a task that is called once

    • It uses the ligands data

    • Side effects include writing the following files:

      • fingerprints/ligands_per_inchikey.parquet

      • fingerprints/ligands_per_inchikey_ecfp4.npy

      • fingerprints/ligands_per_system.parquet

  • tasks.make_ligand_scores: creates the ligand_scores data

    • This is a distributed task that is called in parallel for chunks of ligand IDs

    • It uses the fingerprints data

    • Side effects include writing the following files:

      • ligand_scores/{fragment}.parquet

Sub-databases#

Once the plinder systems have been generated, we are able to split the foldseek and mmseqs databases into sub-databases containing holo and apo systems. We additionally use the alphafold linked_structures to create the pred sub-database.

  • tasks.make_sub_dbs: creates the holo and apo sub-databases

    • This is a task that is called once

    • It uses the foldseek and mmseqs databases

    • Side effects include writing the following files:

      • dbs/subdbs/holo_foldseek/**

      • dbs/subdbs/apo_foldseek/**

      • dbs/subdbs/pred_foldseek/**

      • dbs/subdbs/holo_mmseqs/**

      • dbs/subdbs/apo_mmseqs/**

      • dbs/subdbs/pred_mmseqs/**

Protein similarity#

With the holo and apo sub-databases created, we can run protein similarity scoring for all plinder systems.

  • tasks.run_batch_searches: runs the foldseek and mmseqs searches for large batches

    • This is a distributed task that is called in parallel for large chunks of PDB IDs

    • It uses the JSON files from raw_entries and the holo and apo sub-databases

    • Side effects include writing the following files:

      • foldseek and mmseqs search results

  • tasks.make_batch_scores: creates the protein similarity scores

    • This is a distributed task that is called in parallel for smaller chunks of PDB IDs

    • It uses the JSON files from raw_entries and the foldseek and mmseqs search results

    • Side effects include writing the following files:

      • scores/search_db=holo/*

      • scores/search_db=apo/*

      • scores/search_db=pred/*

MMP and MMS#

  • tasks.make_mmp_index: creates the mmp dataset

    • This is a task that is called once

    • It additionally consolidates the index dataset created in structure_qc into a single parquet file

    • Side effects include writing the following files:

      • mmp/plinder_mmp_series.parquet

      • mmp/plinder_mms.csv.gz

Clustering#

Once the protein similarity scores are generated, we run component and community clustering.

  • tasks.make_components_and_communities: creates the components and communities clusters for given metrics at given thresholds

    • This is a distributed task that is called in parallel for individual tuples of metric and threshold

    • It uses the protein similarity scores and the annotation index

    • Side effects include writing the following files:

      • clusters/**

Splits#

Armed with the clusters from the previous step, we can now split the plinder systems into train, test and val.

Leakage#

With splits in hand, we perform an exhaustive evaluation of the generated splits to quantify the quality of the splits through leakage metrics.

Technical details#

Schemas#

The scores protein similarity dataset is a collection of parquet files with the following schema:

>>> from plinder.data.schemas import PROTEIN_SIMILARITY_SCHEMA
>>> PROTEIN_SIMILARITY_SCHEMA
query_system: string
target_system: string
protein_mapping: string
mapping: string
protein_mapper: dictionary<values=string, indices=int8, ordered=0>
source: dictionary<values=string, indices=int8, ordered=1>
metric: dictionary<values=string, indices=int8, ordered=1>
similarity: int8

The ligand_scores ligand similarity dataset is a collection of parquet files with the following schema:

>>> from plinder.data.schemas import TANIMOTO_SCORE_SCHEMA
>>> TANIMOTO_SCORE_SCHEMA
query_ligand_id: int32
target_ligand_id: int32
tanimoto_similarity_max: int8

The clusters clustering dataset is a collection of parquet files with the following schema:

>>> from plinder.data.schemas import CLUSTER_DATASET_SCHEMA
>>> CLUSTER_DATASET_SCHEMA
metric: string
directed: bool
threshold: int8
system_id: string
component: string

The splits split datasets are independent parquet files with the following schema:

>>> from plinder.data.schemas import SPLIT_DATASET_SCHEMA
>>> SPLIT_DATASET_SCHEMA
system_id: string
split: string
cluster: string
cluster_for_val_split: string

The linked_structures datasets are independent parquet files with the following schema:

>>> from plinder.data.schemas import STRUCTURE_LINK_SCHEMA
>>> STRUCTURE_LINK_SCHEMA
query_system: string
target_system: string
protein_qcov_weighted_sum: float
protein_fident_weighted_sum: float
pocket_fident: float
target_id: string
sort_score: float