Structure abstraction#

from rdkit.Chem.Draw import IPythonConsole

IPythonConsole.drawOptions.addAtomIndices = True

To enable using systems and linked structure files in training deep learning models, we’ve implemented a number of useful functions to align, mask, and featurize proteins and ligands.

For this, we convert our PlinderSystem to a Structure object.

from plinder.core import PlinderSystem

plinder_system = PlinderSystem(system_id="4agi__1__1.C__1.W")

system_structure = plinder_system.holo_structure

system_structure

Structure(
    (
        'id',
        '4agi__1__1.C__1.W',
    ),
    (
        'protein_path',
        /home/runner/.local/share/plinder/2024-06/v2/systems/4agi__1__1.C__1.W/receptor.cif,
    ),
    (
        'protein_sequence',
        {
            '1.C': 'MSTPGAQQVLFRTGIAAVNSTNHLRVYFQDVYGSIRESLYEGSWANGTEKNVIGNAKLGSPVAATSKELKHIRVYTLTEGNTLQEFAYDSGTGWYNGGLGGAKFQVAPYSXIAAVFLAGTDALQLRIYAQKPDNTIQEYMWNGDGWKEGTNLGGALPGTGIGATSFRYTDYNGPSIRIWFQTDDLKLVQRAYDPHKGWYPDLVTIFDRAPPRTAIAATSFGAGNSSIYMRIYFVNSDNTIWQVCWDHGKGYHDKGTITPVIQGSEVAIISWGSFANNGPDLRLYFQNGTYISAVSEWVWNRAHGSQLGRSALPPA',
        },
    ),
    (
        'ligand_sdfs',
        {
            '1.W': '/home/runner/.local/share/plinder/2024-06/v2/systems/4agi__1__1.C__1.W/ligand_files/1.W.sdf',
        },
    ),
    (
        'ligand_smiles',
        {
            '1.W': 'C[Se][C@@H]1O[C@@H](C)[C@@H](O)[C@@H](O)[C@@H]1O',
        },
    ),
    (
        'protein_atom_array',
        <class 'biotite.structure.AtomArray'> with shape (2443,),
    ),
    (
        'ligand_mols',
        {
            '1.W': (
                <rdkit.Chem.rdchem.Mol object at 0x7f3720312420>,
                <rdkit.Chem.rdchem.Mol object at 0x7f37202e2880>,
                <rdkit.Chem.rdchem.Mol object at 0x7f3720312570>,
                (
                    <class 'numpy.ndarray'> with shape (1, 12),
                    <class 'numpy.ndarray'> with shape (1, 12),
                ),
            ),
        },
    ),
    (
        'add_ligand_hydrogens',
        False,
    ),
    (
        'skip_3d_confgen',
        False,
    ),
    (
        'structure_type',
        'holo',
    ),
)

We list all fields and their FieldInfo to show which ones are required. id, protein_path and protein_sequence are required. Everything else is optionally. Particularly worth mentioning is the decision to make list_ligand_sdf_and_input_smiles optional; this is because ligand will not be availbale in apo and predicted structures.
Out of these field ligand_mols and protein_atom_array is computed within the object if set to default.

ligand_mols returns a chain-mapped dictionary of of the form:

{
    "<instance_id>.<chain_id>": (
        RDKit 2D mol from template SMILES of type `Chem.Mol`,
        RDKit mol from template SMILES with random 3D conformer of type `Chem.Mol`,
        RDKit mol of solved (holo) ligand structure of type `Chem.Mol`,
        paired stacked arrays (template vs holo) mapping atom order by index of type `tuple[NDArray.int_, NDArray.int_]`
    )

}

While protein_atom_array returns biotite AtomArray of the receptor protein structure.
add_ligand_hydrogens specifies whether to adds hydrogens to ligand
structure_type: could be "holo", "apo" or "pred"

system_structure.model_fields

{'id': FieldInfo(annotation=str, required=True),
 'protein_path': FieldInfo(annotation=Path, required=True),
 'protein_sequence': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default=None),
 'ligand_sdfs': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default=None),
 'ligand_smiles': FieldInfo(annotation=Union[dict[str, str], NoneType], required=False, default=None),
 'protein_atom_array': FieldInfo(annotation=Union[AtomArray, NoneType], required=False, default=None),
 'ligand_mols': FieldInfo(annotation=Union[dict[str, tuple[Mol, Mol, Mol, tuple[ndarray[Any, dtype[+_ScalarType_co]], ndarray[Any, dtype[+_ScalarType_co]]]]], NoneType], required=False, default=None),
 'add_ligand_hydrogens': FieldInfo(annotation=bool, required=False, default=False),
 'skip_3d_confgen': FieldInfo(annotation=bool, required=False, default=False),
 'structure_type': FieldInfo(annotation=str, required=False, default='holo')}

Ligand#

for name in system_structure.get_properties():
    if "ligand" in name:
        print(name)

input_ligand_conformer_atom_array
input_ligand_conformers
input_ligand_conformers_coords
input_ligand_templates
ligand_atom_array
ligand_chain_ordered
ligand_template2resolved_atom_order_stacks
resolved_ligand_mols
resolved_ligand_mols_coords

The ligands are provided using dictionaries.

These dictionaries contain information for each ligand:

input_ligand_templates: 2D RDKit mols generated from the RDKit canonical SMILES
input_ligand_conformers: 3D (random) conformers generated for each input mol
input_ligand_conformers_coords: positional coordintates for 3D conformers
resolved_ligand_mols: RDKit mols of solved (holo) ligand structures
resolved_ligand_mols_coords: positional coordintates for holo ligand structures
ligand_template2resolved_atom_order_stacks: paired stacked arrays (template vs holo) mapping atom order by index
ligand_chain_ordered: ordered list of all ligands by their keys

Ligand atom id mapping mapping#

Unlike the protein sequence - there is no canonical order to ligand atoms in the molecule. It can be further complicated by automorphisms present in the structure due to symmetry, i.e. there is more than one match that is possible between the structures.

This is important when calculating ligand structure loss, as the most optimal atom order can change between the different inference results. Typically, it is accepted to take the atom ordering resulting in the best objective score and use that for the loss calculation.

Occasionally futher ambiguity arises to to part of the ligand structure being unresolved in the holo structure - this can lead to multiple available matches. We use RascalMCES algorithm from RDKit to provide all the possible matches between the atom order in the input structure (from SMILES) to the resolved holo structure.

This is provided as stacks of atom order arrays that reorder the template and holo indices to provide matches. Each stack is a unique order transformation and should be iterated.

system_structure.input_ligand_templates[system_structure.ligand_chain_ordered[0]]

../_images/89d09a7aa9c107b12501231eaebf0e276434f0f9920743ddd902dabe13061d80.png

system_structure.input_ligand_conformers[system_structure.ligand_chain_ordered[0]]

../_images/938def1f2b08cad6356b8b5332743c39775a46abf8b8d93aa6fc195484e9861b.png

system_structure.resolved_ligand_mols[system_structure.ligand_chain_ordered[0]]

../_images/a45d4400ac54e82fd981e2b78ce3643e7090b5ffa4c6c567be5d5c8c6830cb6d.png

Ligand conformer coordinates#

As you can tell, the input 2D and 3D conformer indices match, but the resolved ligand is different. Thus to perform a correct comparison for their coordinates one should use atom order stacks.

(
    input_atom_order_stack,
    holo_atom_order_stack,
) = system_structure.ligand_template2resolved_atom_order_stacks[
    system_structure.ligand_chain_ordered[0]
]
input_atom_order_stack, holo_atom_order_stack

(array([[ 2, 10,  8,  6,  4,  5, 11,  9,  7,  3,  0,  1]]),
 array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]]))

system_structure.input_ligand_conformers_coords[
    system_structure.ligand_chain_ordered[0]
][input_atom_order_stack]

array([[[-0.40725066, -1.47748498,  0.185955  ],
        [ 1.0643663 , -1.31706023,  0.33981289],
        [ 1.51382048, -0.06269305,  1.01334722],
        [ 0.43775803,  1.01633012,  1.02753193],
        [-0.32555831,  0.7522505 , -0.26447634],
        [-1.0585462 ,  1.94667957, -0.76103278],
        [ 1.730437  , -1.48907463, -0.8864381 ],
        [ 2.0006517 , -0.2496771 ,  2.29537409],
        [-0.4092932 ,  0.72044372,  2.10546384],
        [-1.13277182, -0.33267599,  0.01577224],
        [-2.16322047, -1.83693988, -2.49576751],
        [-0.84673963, -2.66539344, -1.30355921]]])

system_structure.resolved_ligand_mols_coords[system_structure.ligand_chain_ordered[0]][
    holo_atom_order_stack
]

array([[[48.954, 28.186, 62.068],
        [48.681, 29.643, 61.777],
        [48.556, 29.933, 60.282],
        [47.499, 28.988, 59.699],
        [47.948, 27.56 , 59.939],
        [46.992, 26.503, 59.374],
        [49.72 , 30.41 , 62.383],
        [48.255, 31.352, 60.148],
        [46.262, 29.193, 60.366],
        [48.031, 27.383, 61.339],
        [50.647, 25.895, 62.426],
        [50.853, 27.716, 61.696]]])

More complicated examples#

Symmetry in ligand (automorphism) - two ways of pairwise mapping the atom order

structure_with_symmetry_in_ligand = PlinderSystem(
    system_id="4v2y__1__1.A__1.E"
).holo_structure
structure_with_symmetry_in_ligand.input_ligand_templates[
    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]
]

../_images/cc2a0f16b66590bd38cc31a07b8f2429bda6670f4a419bc48a41e193c974cd77.png

structure_with_symmetry_in_ligand.resolved_ligand_mols[
    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]
]

../_images/a3ebd5cc7eed49dc537a4794e260f5e319f8a8a56e7a5583a6d995d948562606.png

structure_with_symmetry_in_ligand.ligand_template2resolved_atom_order_stacks[
    structure_with_symmetry_in_ligand.ligand_chain_ordered[0]
]

(array([[17, 16, 18,  1,  0,  2,  3,  4,  5, 10,  9,  8, 13,  7, 15, 12,
         11, 14,  6],
        [17, 16, 18,  1,  0,  2,  3,  4,  5, 11, 12, 13,  8, 15,  7,  9,
         10,  6, 14]]),
 array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
         16, 17, 18],
        [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
         16, 17, 18]]))

Symmetry arises due to ligand being partially resolved - there are three template pieces that can be mapped to the resolved ground truth.

structure_with_partly_resolved_ligand = PlinderSystem(
    system_id="1ngx__1__1.A_1.B__1.E"
).holo_structure
structure_with_partly_resolved_ligand.input_ligand_templates[
    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]
]

../_images/0f3ef822dcd990d1af3ab7afc8f8a24908b2ca6edad310b52e5762893493409a.png

structure_with_partly_resolved_ligand.resolved_ligand_mols[
    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]
]

../_images/99025f64d82e9431b8bfc4fe7914743793fa50f765bb1221aa46c09b9ccb0550.png

structure_with_partly_resolved_ligand.ligand_template2resolved_atom_order_stacks[
    structure_with_partly_resolved_ligand.ligand_chain_ordered[0]
]

(array([[ 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
         26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37],
        [ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
        [ 5,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
         22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]]),
 array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
         16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27],
        [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
         16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27],
        [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
         16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]]))

Protein#

for name in system_structure.get_properties():
    if "protein" in name:
        print(name)

protein_backbone_mask
protein_calpha_coords
protein_calpha_mask
protein_chain_ordered
protein_chains
protein_coords
protein_n_atoms
protein_sequence_from_structure
protein_structure_b_factor
protein_structure_tokenized_sequence
protein_unique_atom_names
protein_unique_residue_ids
protein_unique_residue_names

Others properties#

This includes:

num_ligands: Number of ligand chains
smiles: Ligand smiles dictionary
num_proteins: Number of protein chains

Masking#

The properties protein_backbone_mask and protein_calpha_mask are boolean masks that can be used to select backbone or calpha atoms from biotite AtomArray. The indices of True corresponds to backbone or calpha indices.

print(
    "Total number of atoms:",
    len(system_structure.protein_atom_array),
)
print("Number of backbone atoms:", system_structure.protein_backbone_mask.sum())
print(
    "Number of calpha atoms:",
    system_structure.protein_calpha_mask.sum(),
)

Total number of atoms: 2443
Number of backbone atoms: 942
Number of calpha atoms: 314

calpha_atom_array = system_structure.protein_atom_array[
    system_structure.protein_calpha_mask
]
calpha_atom_array.coord.shape

(314, 3)

You can also filter by arbitrary properties of the AtomArray using the filter method. This returns a new Structure object.

calpha_structure = system_structure.filter(
    property="atom_name",
    mask="CA",
)

calpha_structure.protein_atom_array.coord.shape

(314, 3)

Get protein chain ordered#

This gives a list of protein chains ordered by how they are in the structure

system_structure.protein_chain_ordered

['1.C']

Get protein chains for all atoms#

The list of chain IDs in the structure. Order of how they appear not kept.

system_structure.protein_chains

['1.C']

Get protein coordinates#

This property gets the 3D positions of each of the atoms in protein molecules

system_structure.protein_coords

[array([[48.761, 52.015, 58.719],
        [48.178, 50.968, 57.827],
        [49.186, 50.593, 56.712],
        ...,
        [50.066, 42.485, 54.198],
        [47.368, 43.433, 55.544],
        [50.022, 44.674, 54.248]], dtype=float32)]

Get number of atoms of protein molecule#

system_structure.protein_n_atoms

Get protein structure atom names#

Returns all atoms names the same way they appear in the structure

system_structure.protein_unique_atom_names

['C',
 'CA',
 'CB',
 'CD',
 'CD1',
 'CD2',
 'CE',
 'CE1',
 'CE2',
 'CE3',
 'CG',
 'CG1',
 'CG2',
 'CH2',
 'CM',
 'CZ',
 'CZ2',
 'CZ3',
 'N',
 'ND1',
 'ND2',
 'NE',
 'NE1',
 'NE2',
 'NH1',
 'NH2',
 'NZ',
 'O',
 'OD1',
 'OD2',
 'OE1',
 'OE2',
 'OG',
 'OG1',
 'OH',
 'OXT',
 'SD',
 'SE',
 'SG']

Get protein residue names#

system_structure.protein_unique_residue_names

['ALA',
 'ARG',
 'ASN',
 'ASP',
 'CSD',
 'CSK',
 'GLN',
 'GLU',
 'GLY',
 'HIS',
 'ILE',
 'LEU',
 'LYS',
 'MET',
 'PHE',
 'PRO',
 'SER',
 'THR',
 'TRP',
 'TYR',
 'VAL']

Get protein residues number#

Residue number as they appear in structure

system_structure.protein_unique_residue_ids

Get sequence from protein structure#

system_structure.protein_sequence_from_structure

'STPGAQQVLFRTGIAAVNSTNHLRVYFQDVYGSIRESLYEGSWANGTEKNVIGNAKLGSPVAATSKELKHIRVYTLTEGNTLQEFAYDSGTGWYNGGLGGAKFQVAPYSXIAAVFLAGTDALQLRIYAQKPDNTIQEYMWNGDGWKEGTNLGGALPGTGIGATSFRYTDYNGPSIRIWFQTDDLKLVQRAYDPHKGWYPDLVTIFDRAPPRTAIAATSFGAGNSSIYMRIYFVNSDNTIWQVCWDHGKGYHDKGTITPVIQGSEVAIISWGSFANNGPDLRLYFQNGTYISAVSEWVWNRAHGSQLGRSALPPA'

Note that this is different from the canonical SEQRES sequence due to unresolved terminal residues:

system_structure.protein_sequence

{'1.C': 'MSTPGAQQVLFRTGIAAVNSTNHLRVYFQDVYGSIRESLYEGSWANGTEKNVIGNAKLGSPVAATSKELKHIRVYTLTEGNTLQEFAYDSGTGWYNGGLGGAKFQVAPYSXIAAVFLAGTDALQLRIYAQKPDNTIQEYMWNGDGWKEGTNLGGALPGTGIGATSFRYTDYNGPSIRIWFQTDDLKLVQRAYDPHKGWYPDLVTIFDRAPPRTAIAATSFGAGNSSIYMRIYFVNSDNTIWQVCWDHGKGYHDKGTITPVIQGSEVAIISWGSFANNGPDLRLYFQNGTYISAVSEWVWNRAHGSQLGRSALPPA'}

Get tokenized sequence#

Get tensor of sequence converted to integer-based amino acid tokens

system_structure.protein_structure_tokenized_sequence

tensor([15, 16, 14,  7,  0,  5,  5, 19, 10, 13,  1, 16,  7,  9,  0,  0, 19,  2,
        15, 16,  2,  8, 10,  1, 19, 18, 13,  5,  3, 19, 18,  7, 15,  9,  1,  6,
        15, 10, 18,  6,  7, 15, 17,  0,  2,  7, 16,  6, 11,  2, 19,  9,  7,  2,
         0, 11, 10,  7, 15, 14, 19,  0,  0, 16, 15, 11,  6, 10, 11,  8,  9,  1,
        19, 18, 16, 10, 16,  6,  7,  2, 16, 10,  5,  6, 13,  0, 18,  3, 15,  7,
        16,  7, 17, 18,  2,  7,  7, 10,  7,  7,  0, 11, 13,  5, 19,  0, 14, 18,
        15, 20,  9,  0,  0, 19, 13, 10,  0,  7, 16,  3,  0, 10,  5, 10,  1,  9,
        18,  0,  5, 11, 14,  3,  2, 16,  9,  5,  6, 18, 12, 17,  2,  7,  3,  7,
        17, 11,  6,  7, 16,  2, 10,  7,  7,  0, 10, 14,  7, 16,  7,  9,  7,  0,
        16, 15, 13,  1, 18, 16,  3, 18,  2,  7, 14, 15,  9,  1,  9, 17, 13,  5,
        16,  3,  3, 10, 11, 10, 19,  5,  1,  0, 18,  3, 14,  8, 11,  7, 17, 18,
        14,  3, 10, 19, 16,  9, 13,  3,  1,  0, 14, 14,  1, 16,  0,  9,  0,  0,
        16, 15, 13,  7,  0,  7,  2, 15, 15,  9, 18, 12,  1,  9, 18, 13, 19,  2,
        15,  3,  2, 16,  9, 17,  5, 19,  4, 17,  3,  8,  7, 11,  7, 18,  8,  3,
        11,  7, 16,  9, 16, 14, 19,  9,  5,  7, 15,  6, 19,  0,  9,  9, 15, 17,
         7, 15, 13,  0,  2,  2,  7, 14,  3, 10,  1, 10, 18, 13,  5,  2,  7, 16,
        18,  9, 15,  0, 19, 15,  6, 17, 19, 17,  2,  1,  0,  8,  7, 15,  5, 10,
         7,  1, 15,  0, 10, 14, 14,  0])

Linked protein input structures#

For realistic inference scenarios we need to initialize our protein structures using a linked structure (introduced above). In most cases, these will not be a perfect match to the holo structure - the number of residues, residue numbering, and sometime the sequnce can be different. It’s important to be able to match these structures to ensure that we can map between them.

In the example below, we will take a holo structure and it’s linked predicted (pred) form with different number of residues and match and crop the resulting structures to figure out the correspondence between their residues. For this we use the align_common_sequence function of the holo Structure object, which aligns two structures based on their shared sequences. It has the following parameters:

other: Structure
    The other structure to align to
copy: bool
    Whether to make a copy or edit in-place
remove_differing_atoms: bool
    Whether to remove differing atoms between the two structure
renumber_residues: bool [False]
    If True, renumber residues in the two structures to match and starting from 1.
    If False, sets the resulting residue indices to the one from the aligned sequence
remove_differing_annotations: bool [False]
    Whether to remove differing annotations, like b-factor, etc

In this example, we will match, make copies and crop the structures.

Note

To use this function the proteins to be aligned must have the same chain ids. So, we first set the chain id of the predicted structure to that of the holo structure. :::

plinder_system = PlinderSystem(system_id="4cj6__1__1.A__1.B")
holo = plinder_system.holo_structure
predicted = plinder_system.alternate_structures["P12271_A"]
predicted.set_chain(holo.protein_chain_ordered[0])
holo_cropped, predicted_cropped = holo.align_common_sequence(predicted)
predicted_cropped_superposed, raw_rmsd, refined_rmsd = predicted_cropped.superimpose(
    holo_cropped
)