Datasets and Dataloaders#

from plinder.core.loader.dataset import PlinderDataset, get_torch_loader

PlinderDataset provides an interface to interact with PLINDER data as a dataset. It is a subclass of torch.utils.data.Dataset, as such subclassing it and extending should be familiar to most users. Flexibility and general applicability is our top concern when designing this interface and PlinderDataset allows users to not only define their own split but to also bring their own featurizer. It can be initialized with the following parameters

Parameters
    ----------
    df : pd.DataFrame | None
        the split to use
    split : str
        the split to sample from
    split_parquet_path : str | Path, default=None
        split parquet file
    input_structure_priority : str, default="apo"
        Which alternate structure to proritize
    featurizer: Callable[
            [Structure, int], dict[str, torch.Tensor]
    ] = structure_featurizer,
        Transformation to turn structure to input tensors
    padding_value : int
        Value for padding uneven array
    **kwargs : Any
        Any other keyword args

For an example of how to write your own featurizer see Featurizer Example. The signature is shown below:

def structure_featurizer(
    structure: Structure, pad_value: int = -100
    ) -> dict[str, Any]:

The input is a Structure object and it returns dictionary of padded tensor features.

Note

This is where you may want to load a train dataset, but for the purposes of demonstration - we will start with val due to smaller memory footprint, and load only a small subset of systems containing ATP as ligand. We also set use_alternate_structures=False to prevent downloading and loading alternate structures for the docs.

val_dataset = PlinderDataset(
    split="val",
    filters=[
        ("system_num_protein_chains", "==", 1),
        ("ligand_unique_ccd_code", "in", {"ATP"}),
    ],
    use_alternate_structures=False,
)
len(val_dataset)

val_data = val_dataset[1]

2024-12-23 22:44:18,363 | plinder.core.structure.atoms:221 | WARNING : get_template_to_mol_matches: could not match template fully - retry with unmatched bonds set as UNSPECIFIED

val_loader = get_torch_loader(val_dataset)
for data in val_loader:
    test_torch = data
    break

2024-12-23 22:44:19,154 | plinder.core.structure.atoms:221 | WARNING : get_template_to_mol_matches: could not match template fully - retry with unmatched bonds set as UNSPECIFIED

2024-12-23 22:44:28,133 | plinder.core.structure.atoms:221 | WARNING : get_template_to_mol_matches: could not match template fully - retry with unmatched bonds set as UNSPECIFIED

test_torch.keys()

dict_keys(['system_ids', 'holo_structures', 'alternate_structures', 'paths', 'features_and_coords'])

test_torch["system_ids"]

['4a8s__1__1.A__1.H_1.I', '4jxz__1__1.A__1.C']

for k, v in test_torch["features_and_coords"].items():
    print(k, v.shape)

sequence_atom_mask_feature torch.Size([2, 1, 4656])
input_sequence_residue_mask_feature torch.Size([2, 1, 665])
protein_coordinates torch.Size([2, 1, 5265, 3])
protein_calpha_coordinates torch.Size([2, 1, 664, 3])
input_sequence_full_atom_feature torch.Size([2, 1, 5272, 12])
protein_structure_residue_feature torch.Size([2, 1, 5265, 21])
input_conformer_ligand_feature torch.Size([2, 2, 31, 16])
input_conformer_ligand_coordinates torch.Size([2, 2, 31, 3])
resolved_ligand_mols_feature torch.Size([2, 2, 31, 3])