{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Structure abstraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rdkit.Chem.Draw import IPythonConsole\n",
    "\n",
    "IPythonConsole.drawOptions.addAtomIndices = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To enable using systems and linked structure files in training deep learning models, we've implemented a number of useful functions to align, mask, and featurize proteins and ligands.\n",
    "\n",
    "For this, we convert our `PlinderSystem` to a `Structure` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from plinder.core import PlinderSystem\n",
    "\n",
    "plinder_system = PlinderSystem(system_id=\"4agi__1__1.C__1.W\")\n",
    "\n",
    "system_structure = plinder_system.holo_structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We list all fields and their `FieldInfo` to show which ones are required. `id`, `protein_path` and `protein_sequence` are required. Everything else is optionally. Particularly worth mentioning is the decision to make `list_ligand_sdf_and_input_smiles` optional; this is because ligand will not be availbale in apo and predicted structures.\n",
    "- Out of these field `ligand_mols` and `protein_atom_array` is computed within the object if set to default. \n",
    "- `ligand_mols` returns a chain-mapped dictionary of of the form:\n",
    "    ```\n",
    "    {\n",
    "        \"<instance_id>.<chain_id>\": (\n",
    "            RDKit 2D mol from template SMILES of type `Chem.Mol`,\n",
    "            RDKit mol from template SMILES with random 3D conformer of type `Chem.Mol`,\n",
    "            RDKit mol of solved (holo) ligand structure of type `Chem.Mol`,\n",
    "            paired stacked arrays (template vs holo) mapping atom order by index of type `tuple[NDArray.int_, NDArray.int_]`\n",
    "        )\n",
    "\n",
    "    }\n",
    "    ```\n",
    "- While `protein_atom_array` returns [biotite AtomArray](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) of the receptor protein structure.\n",
    "- `add_ligand_hydrogens` specifies whether to adds hydrogens to ligand\n",
    "- `structure_type`: could be `\"holo\"`, `\"apo\"` or `\"pred\"`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.model_fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ligand"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for name in system_structure.get_properties():\n",
    "    if \"ligand\" in name:\n",
    "        print(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ligands are provided using dictionaries.\n",
    "\n",
    "These dictionaries contain information for each ligand:\n",
    "- `input_ligand_templates`: 2D RDKit mols generated from the RDKit canonical SMILES\n",
    "- `input_ligand_conformers`: 3D (random) conformers generated for each input mol\n",
    "- `input_ligand_conformers_coords`: positional coordintates for 3D conformers\n",
    "- `resolved_ligand_mols`: RDKit mols of solved (holo) ligand structures\n",
    "- `resolved_ligand_mols_coords`: positional coordintates for holo ligand structures\n",
    "- `ligand_template2resolved_atom_order_stacks`: paired stacked arrays (template vs holo) mapping atom order by index\n",
    "- `ligand_chain_ordered`: ordered list of all ligands by their keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ligand atom id mapping mapping\n",
    "\n",
    "Unlike the protein sequence - there is no canonical order to ligand atoms in the molecule.\n",
    "It can be further complicated by automorphisms present in the structure due to symmetry, i.e. there is more than one match that is possible between the structures.\n",
    "\n",
    "This is important when calculating ligand structure loss, as the most optimal atom order can change between the different inference results. Typically, it is accepted to take the atom ordering resulting in the best objective score and use that for the loss calculation.\n",
    "\n",
    "Occasionally futher ambiguity arises to to part of the ligand structure being unresolved in the holo structure - this can lead to multiple available matches. We use RascalMCES algorithm from RDKit to provide all the possible matches between the atom order in the input structure (from SMILES) to the resolved holo structure.\n",
    "\n",
    "This is provided as stacks of atom order arrays that reorder the template and holo indices to provide matches. Each stack is a unique order transformation and should be iterated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.input_ligand_templates[system_structure.ligand_chain_ordered[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.input_ligand_conformers[system_structure.ligand_chain_ordered[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.resolved_ligand_mols[system_structure.ligand_chain_ordered[0]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ligand conformer coordinates\n",
    "\n",
    "As you can tell, the input 2D and 3D conformer indices match, but the resolved ligand is different.\n",
    "Thus to perform a correct comparison for their coordinates one should use atom order stacks.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(\n",
    "    input_atom_order_stack,\n",
    "    holo_atom_order_stack,\n",
    ") = system_structure.ligand_template2resolved_atom_order_stacks[\n",
    "    system_structure.ligand_chain_ordered[0]\n",
    "]\n",
    "input_atom_order_stack, holo_atom_order_stack"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.input_ligand_conformers_coords[\n",
    "    system_structure.ligand_chain_ordered[0]\n",
    "][input_atom_order_stack]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.resolved_ligand_mols_coords[system_structure.ligand_chain_ordered[0]][\n",
    "    holo_atom_order_stack\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More complicated examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Symmetry in ligand (automorphism) - two ways of pairwise mapping the atom order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "structure_with_symmetry_in_ligand = PlinderSystem(\n",
    "    system_id=\"4v2y__1__1.A__1.E\"\n",
    ").holo_structure\n",
    "structure_with_symmetry_in_ligand.input_ligand_templates[\n",
    "    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "structure_with_symmetry_in_ligand.resolved_ligand_mols[\n",
    "    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "structure_with_symmetry_in_ligand.ligand_template2resolved_atom_order_stacks[\n",
    "    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Symmetry arises due to ligand being partially resolved - there are three template pieces that can be mapped to the resolved ground truth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "structure_with_partly_resolved_ligand = PlinderSystem(\n",
    "    system_id=\"1ngx__1__1.A_1.B__1.E\"\n",
    ").holo_structure\n",
    "structure_with_partly_resolved_ligand.input_ligand_templates[\n",
    "    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "structure_with_partly_resolved_ligand.resolved_ligand_mols[\n",
    "    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "structure_with_partly_resolved_ligand.ligand_template2resolved_atom_order_stacks[\n",
    "    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Protein"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for name in system_structure.get_properties():\n",
    "    if \"protein\" in name:\n",
    "        print(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Others properties\n",
    "This includes:\n",
    "- `num_ligands`: Number of ligand chains\n",
    "- `smiles`: Ligand smiles dictionary\n",
    "- `num_proteins`: Number of protein chains\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Masking\n",
    "The properties `protein_backbone_mask` and `protein_calpha_mask` are boolean masks that can be used to select backbone or calpha atoms from biotite `AtomArray`. The indices of `True` corresponds to backbone or calpha indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Total number of atoms:\",\n",
    "    len(system_structure.protein_atom_array),\n",
    ")\n",
    "print(\"Number of backbone atoms:\", system_structure.protein_backbone_mask.sum())\n",
    "print(\n",
    "    \"Number of calpha atoms:\",\n",
    "    system_structure.protein_calpha_mask.sum(),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "calpha_atom_array = system_structure.protein_atom_array[\n",
    "    system_structure.protein_calpha_mask\n",
    "]\n",
    "calpha_atom_array.coord.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also filter by arbitrary properties of the `AtomArray` using the `filter` method. This returns a new `Structure` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "calpha_structure = system_structure.filter(\n",
    "    property=\"atom_name\",\n",
    "    mask=\"CA\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "calpha_structure.protein_atom_array.coord.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get protein chain ordered\n",
    "This gives a list of protein chains ordered by how they are in the structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_chain_ordered"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get protein chains for all atoms\n",
    "The list of chain IDs in the structure. Order of how they appear not kept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_chains"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get protein coordinates\n",
    "This property gets the 3D positions of each of the atoms in protein molecules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_coords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get number of atoms of protein molecule"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_n_atoms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get protein structure atom names\n",
    "Returns all atoms names the same way they appear in the structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_unique_atom_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get protein residue names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_unique_residue_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get protein residues number\n",
    "Residue number as they appear in structure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_unique_residue_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get sequence from protein structure\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_sequence_from_structure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this is different from the canonical SEQRES sequence due to unresolved terminal residues:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_sequence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get tokenized sequence\n",
    "Get tensor of sequence converted to integer-based amino acid tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "system_structure.protein_structure_tokenized_sequence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linked protein input structures\n",
    "\n",
    "\n",
    "For realistic inference scenarios we need to initialize our protein structures using a linked structure (introduced above). In most cases, these will not be a perfect match to the _holo_ structure - the number of residues, residue numbering, and sometime the sequnce can be different. It's important to be able to match these structures to ensure that we can map between them. \n",
    "\n",
    "In the example below, we will take a _holo_ structure and it's linked predicted (_pred_) form with different number of residues and  match and crop the resulting structures to figure out the correspondence between their residues. For this we use the `align_common_sequence` function of the holo `Structure` object, which aligns two structures based on their shared sequences. It has the following parameters:\n",
    "\n",
    "```\n",
    "other: Structure\n",
    "    The other structure to align to\n",
    "copy: bool\n",
    "    Whether to make a copy or edit in-place\n",
    "remove_differing_atoms: bool\n",
    "    Whether to remove differing atoms between the two structure\n",
    "renumber_residues: bool [False]\n",
    "    If True, renumber residues in the two structures to match and starting from 1.\n",
    "    If False, sets the resulting residue indices to the one from the aligned sequence\n",
    "remove_differing_annotations: bool [False]\n",
    "    Whether to remove differing annotations, like b-factor, etc\n",
    "```\n",
    "In this example, we will match, make copies and crop the structures.\n",
    "\n",
    ":::{note} To use this function the proteins to be aligned must have the same chain ids. So, we first set the chain id of the predicted structure to that of the holo structure. :::\n",
    "\n",
    "```python\n",
    "plinder_system = PlinderSystem(system_id=\"4cj6__1__1.A__1.B\")\n",
    "holo = plinder_system.holo_structure\n",
    "predicted = plinder_system.alternate_structures[\"P12271_A\"]\n",
    "predicted.set_chain(holo.protein_chain_ordered[0])\n",
    "holo_cropped, predicted_cropped = holo.align_common_sequence(predicted)\n",
    "predicted_cropped_superposed, raw_rmsd, refined_rmsd = predicted_cropped.superimpose(\n",
    "    holo_cropped\n",
    ")\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}