{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MLSB/PLINDER Data Access\n", "(mlsb-notebook-target)=\n", "\n", "The goal of this tutorial is to provide background information for the MLSB/PLINDER challenge, as well as a simple hands-on demo for how participants can access and use the _PLINDER_ dataset. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background information \n", "\n", "For background information on the rules of the challenge, see [MLSB/P(L)INDER challenge rules](#mlsb-rules-target) for more information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing and loading data for training " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we are going to demonstrate how to get the key input data:\n", "- protein receptor fasta sequence\n", "- small molecules ligand SMILES string\n", "- access to linked _apo_ and _pred_ structure\n", "\n", "\n", "In the process, we will show:\n", "- How to download the _PLINDER_ data\n", "- How to query _PLINDER_ index and splits to select relevant data using `plinder.core` API\n", "- Extract task-specific data one might want to use for training a task-specific ML model, eg. one protein, one ligand\n", "- How to use `plinder.core` API to:\n", " - supply dataset inputs for `train` or `val` splits\n", " - load linked `apo` and `pred` structures\n", " - use diversity subsampling based on cluster labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download _PLINDER_ \n", "\n", "To download, run: `plinder_download --release 2024-06 --iteration v2 --yes`
\n", "This will download and unpack all neccesary files. For more information on download check out [Dataset Tutorial](https://plinder-org.github.io/plinder/tutorial/dataset.html#getting-the-data)\n", "\n", ":::{note} The dataset is hundreds of gigabytes in size; downloading and extracting should take about 40 minutes. If you want to play around with a toy example dataset, please use `--iteration tutorial`\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Loading _PLINDER_ \n", "\n", "We recommend users interact with the dataset using _PLINDER_ Python API.\n", "\n", "To install the API run: ``pip install plinder[loader]``. If you are using `zsh` terminal, you will have to quote the package like ``\"plinder[loader]\"``\n", "\n", "NOTE: once the _PLINDER_ is downloaded locally, you can use it in offline mode to save time for data queries with: `os.environ[\"PLINDER_OFFLINE\"] = \"true\"`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "from __future__ import annotations\n", "import os\n", "import pandas as pd\n", "\n", "# once the _PLINDER_ is downloaded you can set this to true\n", "# os.environ[\"PLINDER_OFFLINE\"] = \"true\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load _PLINDER_ index with selected columns from annotations table\n", "For a full list with descriptions, please refer to [docs](https://plinder-org.github.io/plinder/dataset.html#annotation-tables-index)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from plinder.core.scores import query_index" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-09-18 19:51:50,569 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.17s\n" ] } ], "source": [ "# get plinder index with selected annotation columns specified\n", "plindex = query_index(\n", " columns=[\"system_id\", \"ligand_id\",\n", " \"ligand_rdkit_canonical_smiles\", \"ligand_is_ion\",\n", " \"ligand_is_artifact\", \"system_num_ligand_chains\",\n", " \"system_num_protein_chains\",\n", " \"pli_qcov__100__strong__component\",\n", " \"ligand_is_proper\",\n", " \"system_proper_num_ligand_chains\",\n", " ],\n", " filters=[\n", " (\"system_type\", \"==\", \"holo\"),\n", " (\"system_num_protein_chains\", \"<=\", 5),\n", " (\"system_num_ligand_chains\", \"<=\", 5),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
system_idligand_idligand_rdkit_canonical_smilesligand_is_ionligand_is_artifactsystem_num_ligand_chainssystem_num_protein_chainspli_qcov__100__strong__componentligand_is_propersystem_proper_num_ligand_chains
03grt__1__1.A_2.A__1.B3grt__1__1.BCc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)...FalseFalse12c243140True1
13grt__1__1.A_2.A__1.C3grt__1__1.CN[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](...FalseFalse12c169758True1
23grt__1__1.A_2.A__2.B3grt__1__2.BCc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)...FalseFalse12c242976True1
33grt__1__1.A_2.A__2.C3grt__1__2.CN[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](...FalseFalse12c173553True1
41grx__1__1.A__1.B1grx__1__1.BN[C@@H](CCC(=O)N[C@@H](CS)C(=O)NCC(=O)O)C(=O)OFalseFalse11c186761True1
\n", "
" ], "text/plain": [ " system_id ligand_id \\\n", "0 3grt__1__1.A_2.A__1.B 3grt__1__1.B \n", "1 3grt__1__1.A_2.A__1.C 3grt__1__1.C \n", "2 3grt__1__1.A_2.A__2.B 3grt__1__2.B \n", "3 3grt__1__1.A_2.A__2.C 3grt__1__2.C \n", "4 1grx__1__1.A__1.B 1grx__1__1.B \n", "\n", " ligand_rdkit_canonical_smiles ligand_is_ion \\\n", "0 Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)... False \n", "1 N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](... False \n", "2 Cc1cc2nc3c(=O)[nH]c(=O)nc-3n(C[C@H](O)[C@H](O)... False \n", "3 N[C@@H](CCC(=O)N[C@H]1CSSC[C@H](NC(=O)CC[C@H](... False \n", "4 N[C@@H](CCC(=O)N[C@@H](CS)C(=O)NCC(=O)O)C(=O)O False \n", "\n", " ligand_is_artifact system_num_ligand_chains system_num_protein_chains \\\n", "0 False 1 2 \n", "1 False 1 2 \n", "2 False 1 2 \n", "3 False 1 2 \n", "4 False 1 1 \n", "\n", " pli_qcov__100__strong__component ligand_is_proper \\\n", "0 c243140 True \n", "1 c169758 True \n", "2 c242976 True \n", "3 c173553 True \n", "4 c186761 True \n", "\n", " system_proper_num_ligand_chains \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plindex.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "system_num_protein_chains\n", "1 404288\n", "2 203489\n", "3 34249\n", "4 6718\n", "5 1171\n", "Name: system_id, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display number of system neighboring protein chains\n", "plindex.groupby(\"system_num_protein_chains\").system_id.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extracting specific data using _PLINDER_ annotations \n", "As we can see just from the data tables above - a significant fraction of _PLINDER_ systems contain complex multi protein chain systems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Task specific selection\n", "If we would like to focus on single protein and single ligand systems for training, we can use the annotated columns to filter out systems that:\n", "- contain only one protein chain\n", "- only one ligand\n", "\n", "Remember: In _PLINDER_ artifacts and (single atom) ions are also included in the index if they are part of the pocket.\n", "- `ligand_is_proper` combines columns `ligand_is_ion` and `ligand_is_artifact` to only select \"proper\" ligands.\n", "\n", "Let's find out how many annotated ligands are \"proper\"." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ligand_is_proper\n", "False 116209\n", "True 533706\n", "Name: system_id, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plindex.groupby(\"ligand_is_proper\").system_id.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### User choice\n", "\n", "The annotations table gives flexibility to choose the systems for training:\n", "- One could strictly choose to use only the data that contains single protein single ligand systems\n", "- Alternatively one could expand the number of systems to include systems containing single proper ligands, and optionally ignore the artifacts and ions in the pocket\n", "\n", "Let's compare the numbers of such systems!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of single receptor single ligand systems: 238228\n", "Number of single receptor single \"proper\" ligand systems: 283863\n" ] } ], "source": [ "# create mask for single receptor single ligand systems\n", "systems_1p1l = (plindex[\"system_num_protein_chains\"] == 1) & (plindex[\"system_num_ligand_chains\"] == 1)\n", "\n", "# create mask only for single receptor single \"proper\" ligand systems\n", "systems_proper_1p1l = (plindex[\"system_num_protein_chains\"] == 1) & (plindex[\"system_proper_num_ligand_chains\"] == 1) & plindex[\"ligand_is_proper\"]\n", "\n", "print(f\"Number of single receptor single ligand systems: {sum(systems_1p1l)}\")\n", "print(f\"Number of single receptor single \\\"proper\\\" ligand systems: {sum(systems_proper_1p1l)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see - the second choice can provide up to 20% more data for training, however, the caveat is that some of the interactions made by artifacts or ions may influence the binding pose of the \"proper\" ligand. The user could come up with further strategies to filtering using annotations table or external tools, but this is beyond the scope of this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loading splits\n", "\n", "Now, after curating the systems of interest, let's have a look at the splits using _PLINDER_ API.\n", "\n", "- How to use {mod}`plinder.core` API to supply dataset inputs for `train` or `val` splits" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from plinder.core import get_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Getting the splits\n", "\n", "The `get_split` function provided the current _PLINDER_ split, the detailed description of this DataFrame is provide in the [dataset documentation](https://plinder-org.github.io/plinder/dataset.html#splits-splits), but for our practical purposes we are mostly interested in `system_id` and `split` that assigns each of our systems to a specific split category." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-09-18 19:51:51,551 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.17s\n", "2024-09-18 19:51:51,713 | plinder.core.split.utils:40 | INFO : reading /Users/tjd/.local/share/plinder/2024-06/v2/splits/split.parquet\n", "2024-09-18 19:51:51,891 | plinder.core.split.utils.get_split:24 | INFO : runtime succeeded: 0.58s\n" ] } ], "source": [ "# get the current plinder split\n", "split_df = get_split()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
system_iduniquenesssplitclustercluster_for_val_splitsystem_pass_validation_criteriasystem_pass_statistics_criteriasystem_proper_num_ligand_chainssystem_proper_pocket_num_residuessystem_proper_num_interactionssystem_proper_ligand_max_molecular_weightsystem_has_binding_affinitysystem_has_apo_or_pred
0101m__1__1.A__1.C_1.D101m__A__C_D_c188899trainc14c0TrueTrue12720616.177293FalseFalse
1102m__1__1.A__1.C102m__A__C_c237197trainc14c0TrueTrue12620616.177293FalseTrue
2103m__1__1.A__1.C_1.D103m__A__C_D_c252759trainc14c0FalseTrue12616616.177293FalseFalse
3104m__1__1.A__1.C_1.D104m__A__C_D_c274687trainc14c0FalseTrue12721616.177293FalseFalse
4105m__1__1.A__1.C_1.D105m__A__C_D_c221688trainc14c0FalseTrue12820616.177293FalseFalse
\n", "
" ], "text/plain": [ " system_id uniqueness split cluster \\\n", "0 101m__1__1.A__1.C_1.D 101m__A__C_D_c188899 train c14 \n", "1 102m__1__1.A__1.C 102m__A__C_c237197 train c14 \n", "2 103m__1__1.A__1.C_1.D 103m__A__C_D_c252759 train c14 \n", "3 104m__1__1.A__1.C_1.D 104m__A__C_D_c274687 train c14 \n", "4 105m__1__1.A__1.C_1.D 105m__A__C_D_c221688 train c14 \n", "\n", " cluster_for_val_split system_pass_validation_criteria \\\n", "0 c0 True \n", "1 c0 True \n", "2 c0 False \n", "3 c0 False \n", "4 c0 False \n", "\n", " system_pass_statistics_criteria system_proper_num_ligand_chains \\\n", "0 True 1 \n", "1 True 1 \n", "2 True 1 \n", "3 True 1 \n", "4 True 1 \n", "\n", " system_proper_pocket_num_residues system_proper_num_interactions \\\n", "0 27 20 \n", "1 26 20 \n", "2 26 16 \n", "3 27 21 \n", "4 28 20 \n", "\n", " system_proper_ligand_max_molecular_weight system_has_binding_affinity \\\n", "0 616.177293 False \n", "1 616.177293 False \n", "2 616.177293 False \n", "3 616.177293 False \n", "4 616.177293 False \n", "\n", " system_has_apo_or_pred \n", "0 False \n", "1 True \n", "2 False \n", "3 False \n", "4 False " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "split_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some specific method developers working on _flexible_ docking may also find handy the annotation column `system_has_apo_or_pred` indicating if the system has available `apo` or `pred` linked structures (see later)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "split system_has_apo_or_pred\n", "removed False 56876\n", " True 41842\n", "test False 548\n", " True 488\n", "train False 189703\n", " True 119437\n", "val False 456\n", " True 376\n", "Name: system_id, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "split_df.groupby([\"split\", \"system_has_apo_or_pred\"]).system_id.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For simplicity let's merge plindex and split DataFrames into one" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# merge to a single DataFrame\n", "plindex_split = plindex.merge(split_df, on=\"system_id\", how=\"left\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting links to `apo` or `pred` structures \n", "\n", ":::{currentmodule} plinder.core\n", ":::\n", "\n", "For users interested in including `apo` and `pred` structures in their workflow, all the information needed can be obtained from the function {func}`query_links`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from plinder.core.scores import query_links" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-09-18 19:51:52,879 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.29s\n", "2024-09-18 19:51:53,165 | plinder.core.scores.links.query_links:24 | INFO : runtime succeeded: 0.81s\n" ] } ], "source": [ "links_df = query_links(\n", " columns=[\"reference_system_id\", \"id\", \"sort_score\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{note} The table is sorted by `sort_score` that is resolution for `apo`s and `plddt` for `pred`s. The `apo` or `pred` is specified in the additionally added `filename` and `kind` column that specifies if the structure was sourced from PDB or AF2DB, respectively.\n", ":::" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reference_system_ididsort_scorekind
06pl9__1__1.A__1.C2vb1_A0.65apo
16ahh__1__1.A__1.G2vb1_A0.65apo
25b59__1__1.A__1.B2vb1_A0.65apo
33ato__1__1.A__1.B2vb1_A0.65apo
46mx9__1__1.A__1.K2vb1_A0.65apo
\n", "
" ], "text/plain": [ " reference_system_id id sort_score kind\n", "0 6pl9__1__1.A__1.C 2vb1_A 0.65 apo\n", "1 6ahh__1__1.A__1.G 2vb1_A 0.65 apo\n", "2 5b59__1__1.A__1.B 2vb1_A 0.65 apo\n", "3 3ato__1__1.A__1.B 2vb1_A 0.65 apo\n", "4 6mx9__1__1.A__1.K 2vb1_A 0.65 apo" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "links_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a user wants to consider only one linked structure per system - we can easily drop duplicates, first sorting by `sort_score`. Using this priority score, `pred` structures will not be used unless there is no `apo` available. Alternative can be achieved by sorting with `ascending=False`, or filtering by `kind==\"pred\"` column." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reference_system_ididsort_scorekind
06pl9__1__1.A__1.C2vb1_A0.65apo
1106agr__1__1.A__1.G2vb1_A0.65apo
1114qgz__1__1.A__1.C2vb1_A0.65apo
1124owa__2__1.B__1.NA2vb1_A0.65apo
1136wgo__1__1.A__1.E2vb1_A0.65apo
\n", "
" ], "text/plain": [ " reference_system_id id sort_score kind\n", "0 6pl9__1__1.A__1.C 2vb1_A 0.65 apo\n", "110 6agr__1__1.A__1.G 2vb1_A 0.65 apo\n", "111 4qgz__1__1.A__1.C 2vb1_A 0.65 apo\n", "112 4owa__2__1.B__1.NA 2vb1_A 0.65 apo\n", "113 6wgo__1__1.A__1.E 2vb1_A 0.65 apo" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "single_links_df = links_df.sort_values(\"sort_score\", ascending=True).drop_duplicates(\"reference_system_id\")\n", "single_links_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have links to `apo` / `pred` structures, we can see how many of those are available for our single protein single ligand systems" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "split system_has_apo_or_pred\n", "removed False 4720\n", " True 41685\n", "test False 59\n", " True 487\n", "train False 33897\n", " True 118925\n", "val False 26\n", " True 374\n", "Name: system_id, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plindex_split[systems_1p1l].groupby([\"split\", \"system_has_apo_or_pred\"]).system_id.count()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "plindex_split_1p1l_links = plindex_split[systems_1p1l].merge(single_links_df, left_on=\"system_id\", right_on=\"reference_system_id\", how=\"left\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "split system_has_linked_apo_or_pred\n", "removed False 7101\n", " True 39304\n", "test False 76\n", " True 470\n", "train False 47109\n", " True 105713\n", "val False 30\n", " True 370\n", "Name: system_id, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's check how many systems have linked structures\n", "plindex_split_1p1l_links['system_has_linked_apo_or_pred'] = ~plindex_split_1p1l_links.kind.isna()\n", "plindex_split_1p1l_links.groupby([\"split\", \"system_has_linked_apo_or_pred\"]).system_id.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Selecting final dataset\n", "Let's select only the set that has linked structures for flexible docking" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "split system_has_linked_apo_or_pred\n", "test True 470\n", "train True 105713\n", "val True 370\n", "Name: system_id, dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plindex_final_df = plindex_split_1p1l_links[\n", " (plindex_split_1p1l_links.system_has_linked_apo_or_pred) & (plindex_split_1p1l_links.split != \"removed\")\n", "]\n", "plindex_final_df.groupby([\"split\", \"system_has_linked_apo_or_pred\"]).system_id.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using _PLINDER_ API to load dataset by split \n", "\n", "More to come here after revamping the data loader code in `plinder`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{currentmodule} plinder.core\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{note} if files not already available this downloads them to `~/.local/share/plinder/{PLINDER_RELEASE}/{PLINDER_ITERATION}` directory\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using _PLINDER_ clusters in sampling " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Define diversity sampler function\n", ":::{currentmodule} plinder.core\n", ":::\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we have provided an example of how one might use the function `get_diversity_samples` which is based on `torch.utils.data.WeightedRandomSampler`.\n", "\n", "NOTE: This example function is provided for demonstration purposes and users are encouraged to come up with sampling strategy that suits their need.
\n", "\n", "In general, diversity can be sampled using cluster information described [here](https://plinder-org.github.io/plinder/dataset.html#clusters-clusters).\n", "All cluster information can easily be added to `plindex`.
\n", "\n", "See below an example, we are going to sample based on the following cluster label:\n", "`pli_qcov__70__community`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned DataFrame could then be passed to {func}`get_model_input_files` the same way `plindex_final_df` was used above." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "ename": "ImportError", "evalue": "cannot import name 'get_model_input_files' from 'plinder.core.loader.loader' (/Users/tjd/projects/github/plinder/src/plinder/core/loader/loader.py)", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mImportError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[22], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mplinder\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mcore\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mloader\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mloader\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m get_diversity_samples\n\u001b[1;32m 3\u001b[0m cluster_column \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mpli_qcov__70__community\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 4\u001b[0m plindex_clusters \u001b[38;5;241m=\u001b[39m query_index(columns\u001b[38;5;241m=\u001b[39m[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124msystem_id\u001b[39m\u001b[38;5;124m\"\u001b[39m, cluster_column])\n", "File \u001b[0;32m~/projects/github/plinder/src/plinder/core/loader/__init__.py:4\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Copyright (c) 2024, Plinder Development Team\u001b[39;00m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;66;03m# Distributed under the terms of the Apache License 2.0\u001b[39;00m\n\u001b[0;32m----> 4\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mloader\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m PlinderDataset, get_diversity_samples, get_model_input_files\n\u001b[1;32m 6\u001b[0m __all__ \u001b[38;5;241m=\u001b[39m [\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPlinderDataset\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mget_model_input_files\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mget_diversity_samples\u001b[39m\u001b[38;5;124m\"\u001b[39m]\n", "\u001b[0;31mImportError\u001b[0m: cannot import name 'get_model_input_files' from 'plinder.core.loader.loader' (/Users/tjd/projects/github/plinder/src/plinder/core/loader/loader.py)" ] } ], "source": [ "from plinder.core.loader.loader import get_diversity_samples\n", "\n", "cluster_column = \"pli_qcov__70__community\"\n", "plindex_clusters = query_index(columns=[\"system_id\", cluster_column])\n", "plindex_with_clusters = plindex_final_df.merge(plindex_clusters, on=\"system_id\", how=\"left\")\n", "sampled_df = get_diversity_samples(split_df=plindex_with_clusters, cluster_column=cluster_column)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 4 }