{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Datasets and Dataloaders" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from plinder.core.loader.dataset import PlinderDataset, get_torch_loader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "`PlinderDataset` provides an interface to interact with _PLINDER_ data as a dataset. It is a subclass of `torch.utils.data.Dataset`, as such subclassing it and extending should be familiar to most users. Flexibility and general applicability is our top concern when designing this interface and `PlinderDataset` allows users to not only define their own split but to also bring their own featurizer.\n", "It can be initialized with the following parameters\n", "```\n", "Parameters\n", " ----------\n", " df : pd.DataFrame | None\n", " the split to use\n", " split : str\n", " the split to sample from\n", " split_parquet_path : str | Path, default=None\n", " split parquet file\n", " input_structure_priority : str, default=\"apo\"\n", " Which alternate structure to proritize\n", " featurizer: Callable[\n", " [Structure, int], dict[str, torch.Tensor]\n", " ] = structure_featurizer,\n", " Transformation to turn structure to input tensors\n", " padding_value : int\n", " Value for padding uneven array\n", " **kwargs : Any\n", " Any other keyword args\n", "```\n", "\n", "For an example of how to write your own featurizer see [Featurizer Example](https://github.com/plinder-org/plinder/blob/c36eef9b02823ce572de905c094f6c85c03800ca/src/plinder/core/loader/featurizer.py#L16). The signature is shown below:\n", "```\n", "def structure_featurizer(\n", " structure: Structure, pad_value: int = -100\n", " ) -> dict[str, Any]:\n", "```\n", "The input is a `Structure` object and it returns dictionary of padded tensor features.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{note}\n", "This is where you may want to load a `train` dataset, but for the purposes of demonstration - we will start with `val` due to smaller memory footprint, and load only a small subset of systems containing `ATP` as ligand. We also set `use_alternate_structures=False` to prevent downloading and loading alternate structures for the docs.\n", ":::" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val_dataset = PlinderDataset(\n", " split=\"val\",\n", " filters=[\n", " (\"system_num_protein_chains\", \"==\", 1),\n", " (\"ligand_unique_ccd_code\", \"in\", {\"ATP\"}),\n", " ],\n", " use_alternate_structures=False,\n", ")\n", "len(val_dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val_data = val_dataset[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val_loader = get_torch_loader(val_dataset)\n", "for data in val_loader:\n", " test_torch = data\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_torch.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_torch[\"system_ids\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for k, v in test_torch[\"features_and_coords\"].items():\n", " print(k, v.shape)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 2 }