{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python API tutorial\n", "\n", "## Setup\n", "\n", "### Installation\n", "\n", "`plinder` is available on *PyPI*.\n", "\n", "```\n", "pip install plinder\n", "```\n", "\n", "### Environment variable configuration\n", "\n", "We need to set environment variables to point to the release and iteration of choice.\n", "For the sake of demonstration, this will be set to point to a smaller tutorial example\n", "dataset, which are `PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=tutorial`.\n", "\n", ":::{note}\n", "The version used for the preprint is `PLINDER_RELEASE=2024-04` and\n", "`PLINDER_ITERATION=v1`, while the current version with updated annotations to be used\n", "for the MLSB challenge is`PLINDER_RELEASE=2024-06` and `PLINDER_ITERATION=v2`.\n", ":::" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "\n", "release = \"2024-06\"\n", "iteration = \"tutorial\"\n", "os.environ[\"PLINDER_RELEASE\"] = release\n", "os.environ[\"PLINDER_ITERATION\"] = iteration\n", "os.environ[\"PLINDER_REPO\"] = str(Path.home()/\"plinder-org/plinder\")\n", "os.environ[\"PLINDER_LOCAL_DIR\"] = str(Path.home()/\".local/share/plinder\")\n", "os.environ[\"GCLOUD_PROJECT\"] = \"plinder\"\n", "version = f\"{release}/{iteration}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As alternative these variables could also be set from terminal via `export` (*UNIX*) or\n", "`set` (*Windows*)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "The user-facing subpackage of `plinder` is {mod}`plinder.core`.\n", "This provides access to the underlying utility functions for accessing the dataset,\n", "split and annotations.\n", "It provides access to five top-level functions:\n", "\n", ":::{currentmodule} plinder.core\n", ":::\n", "\n", "- {func}`get_config()`: access *PLINDER* global configuration\n", "- {func}`get_plindex()`: access full annotation table\n", "- {func}`get_split`: access full split table\n", "\n", ":::{currentmodule} plinder\n", ":::\n", "\n", "In addition, it provides access to the data class {class}`PlinderSystem` for\n", "reconstituting a *PLINDER* system from its `system_id`.\n", "\n", "To supplement these data, {mod}`plinder.core.scores` provides functionality for\n", "querying metrics, such as protein/ligand similarity and cluster identity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the configuration\n", "\n", "At first we get the configuration to check that all parameters are correctly set. \n", "In the snippet below, we will check, if the local and remote *PLINDER* paths point to\n", "the expected location." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "local cache directory: /Users/yusuf/.local/share/plinder/2024-06/tutorial\n", "remote data directory: gs://plinder/2024-06/tutorial\n" ] } ], "source": [ "import plinder.core.utils.config\n", "\n", "cfg = plinder.core.get_config()\n", "print(f\"local cache directory: {cfg.data.plinder_dir}\")\n", "print(f\"remote data directory: {cfg.data.plinder_remote}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query annotations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Query specific columns \n", "\n", ":::{currentmodule} plinder.core.scores\n", ":::\n", "\n", "To query the annotations table for specific columns or filter by specific criteria, use\n", "{func}`query_index()`.\n", "The function could be called without any argument to yield a [`pandas`](https://pandas.pydata.org) dataframe of `system_id` and\n", "`entry_pdb_id`.\n", "However, the function could be called by passing `columns` argument, which is a list of\n", "[column names](#annotation-table-target). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from plinder.core.scores import query_index\n", "# Get system_id and entry_pdb_id columns\n", "query_index()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get specific columns from the annotation table\n", "cols_of_interest = [\"system_id\", \"entry_pdb_id\", \"entry_release_date\", \"entry_oligomeric_state\",\n", "\"entry_clashscore\", \"entry_resolution\"]\n", "query_index(columns=cols_of_interest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Query annotations with specific filters\n", "\n", "We could also pass additional `filters`, where each filter is a logical comparison\n", "of a column name with some given value.\n", "Only those rows, that fulfill all conditions, are returned.\n", "See the description of\n", "[`pandas.read_parquet()`]https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html\n", "for more information on the filter syntax." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Query for single-ligand systems\n", "filters = [(\"system_num_ligand_chains\", \"==\", \"1\")]\n", "query_index(columns=cols_of_interest, filters=filters)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{note}\n", ":::{currentmodule} plinder.core\n", ":::\n", "To load all the columns, users can use the function {func}`get_plindex()` which returns all the columns in the dataframe. However, since this table has over 1.3 million row and 500 columns, it has a significant memory footprint and users are advised to query only columns they need.\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query protein similarity\n", "The are three kinds of similarity datasets we provide:\n", "- Similarity between ligand bound structures (`holo`)\n", "- Similarity between ligand bound and unbound protein structures (`apo`)\n", "- Similarity between ligand bound and Alphafold predicted structures (`pred`)\n", "Any of these could be specified with {func}`query_protein_similarity()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{note} With the full dataset, some similarity queries might require a large amount of memory. For example, `query_protein_similarity(search_db=\"holo\", filters=[(\"similarity\", \">\", \"50\")]) will use up >500G RAM.:::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we will query protein similarity dataset to assess the protein-ligand interaction similarity between example training and test set" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-08-27 11:32:47,823 | plinder.core.utils.cpl.download_paths:24 | INFO : runtime succeeded: 0.00s\n", "2024-08-27 11:32:47,978 | plinder.core.scores.protein.query_protein_similarity:24 | INFO : runtime succeeded: 2.12s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
query_systemtarget_systemsimilarity
01b5d__1__1.A_1.B__1.D1b5e__2__1.A_1.B__1.D83
11b5d__1__1.A_1.B__1.D6a9a__1__1.A_2.A__2.C_2.D83
21b5d__1__1.A_1.B__1.D1jtu__1__1.A_1.B__1.C_1.D67
31b5d__1__1.A_1.B__1.D7jxf__1__1.A_1.B__1.G67
44n7m__1__1.A_1.B__1.C8f9d__2__1.C_1.D__1.G50
\n", "
" ], "text/plain": [ " query_system target_system similarity\n", "0 1b5d__1__1.A_1.B__1.D 1b5e__2__1.A_1.B__1.D 83\n", "1 1b5d__1__1.A_1.B__1.D 6a9a__1__1.A_2.A__2.C_2.D 83\n", "2 1b5d__1__1.A_1.B__1.D 1jtu__1__1.A_1.B__1.C_1.D 67\n", "3 1b5d__1__1.A_1.B__1.D 7jxf__1__1.A_1.B__1.G 67\n", "4 4n7m__1__1.A_1.B__1.C 8f9d__2__1.C_1.D__1.G 50" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from plinder.core.scores import query_protein_similarity\n", "# Example train systems\n", "train = [\"7jxf__1__1.A_1.B__1.G\", \"1jtu__1__1.A_1.B__1.C_1.D\",\n", " \"8f9d__2__1.C_1.D__1.G\", \"6a9a__1__1.A_2.A__2.C_2.D\",\n", " \"1b5e__2__1.A_1.B__1.D\"]\n", "# Example test systems\n", "test = [\"1b5d__1__1.A_1.B__1.D\", \"1s2g__1__1.A_2.C__1.D\",\n", " \"4agi__1__1.C__1.W\", \"4n7m__1__1.A_1.B__1.C\",\n", " \"7eek__1__1.A__1.I\"]\n", "\n", "metric = \"pli_unique_qcov\"\n", "threshold = 50\n", "query_protein_similarity(\n", " search_db=\"holo\",\n", " columns=[\"query_system\", \"target_system\", \"similarity\"],\n", " filters=[\n", " (\"query_system\", \"in\", test),\n", " (\"target_system\", \"in\", train),\n", " (\"metric\", \"==\", metric),\n", " (\"similarity\", \">=\", str(threshold)),\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with a PLINDER system\n", "\n", "A {class}`PlinderSystem` is the representation of a single System.\n", "This object provides access to all PDB entry and system level annotations, as well as\n", "the structures of the system components." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load systems from IDs\n", "\n", "To reconstitute PLINDER systems directly from a set of IDs use class {class}`PlinderSystem`.\n" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "from plinder.core import PlinderSystem\n", "plinder_system = PlinderSystem(system_id=\"4agi__1__1.C__1.W\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Users can choose the granularity level of input:\n", "In the cases above the systems were specified by their system ID, but as alternative\n", "passing PDB IDs (or their two middle characters) is also possible, which gives you all\n", "systems corresponding to the given PDB IDs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing annotations\n", "\n", "The `PlinderSystem.entry` property provides PDB entry-level annotations for that system.\n", "Here, we will list the accessible categories of entry annotations and access the\n", "oligomeric state of a given system." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entry_annotations = plinder_system.entry\n", "print(list(entry_annotations.keys()))\n", "print(entry_annotations[\"oligomeric_state\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, `PlinderSystem.system` returns annotations on the system level.\n", "Here, we will extract the SMILES string of the first ligand of a given system." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "system_annotations = plinder_system.system\n", "print(list(system_annotations.keys()))\n", "# Show ligand smiles of the first ligand of a given system\n", "print(system_annotations[\"ligands\"][0][\"smiles\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting structure file paths\n", "\n", "The `PlinderSystem` also provides access to the structure files the system is based on.\n", "This could be helpful for loading the structures for training a model or performing\n", "other calculations that require structural information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(plinder_system.ligands)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same can be done for the receptor protein." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(plinder_system.receptor_pdb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect apo and predicted annotations\n", "\n", "For users interested in using apo and predicted structures in model training, the\n", "snippet below maps holo system IDs (`reference_system_id`) to apo or predicted\n", "IDs (`id`) and reports their similarity measures as well.\n", "This similarity data includes protein and pocket similarity (see description\n", "[here](/evaluation.md)), as well as all evaluation metrics calculated upon superposition\n", "and transplantation of ligands into each apo/predicted structure.\n", "Another way to access the information directly wil be to use {func}`query_links`\n", "directly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plinder_system.linked_structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Querying {func}`query_links` can be done directly via:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from plinder.core.scores import query_links\n", "links = query_links()\n", "links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will use this table to get the PDB and chain IDs for apo structures\n", "corresponding to a given system ID." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(links[\n", " (links.reference_system_id == \"4agi__1__1.C__1.W\") & (links.kind == \"apo\")\n", "].id.to_list())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The structure file locations for the linked structures can also be obtained.\n", "The directory names are named after the `reference_system_id` and `id` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for file in plinder_system.linked_archive.glob(\"**/*.cif\"):\n", " print(file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with split data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get split table\n", "\n", "The split table sorts each PLINDER system into a cluster and defines the split it is\n", "part of.\n", "To access the splits, use {func}`get_split()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from plinder.core import get_split\n", "split_df = get_split()\n", "split_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example this table can be used to get all system IDs that belong to the *test*\n", "split." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "split_df[split_df.split == \"test\"].system_id.to_list()" ] } ], "metadata": { "kernelspec": { "display_name": "plinder", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 2 }