{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Datasets and Dataloaders"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from plinder.core.loader.dataset import PlinderDataset, get_torch_loader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "`PlinderDataset` provides an interface to interact with _PLINDER_ data as a dataset. It is a subclass of `torch.utils.data.Dataset`, as such subclassing it and extending should be familiar to most users. Flexibility and general applicability is our top concern when designing this interface and `PlinderDataset` allows users to not only define their own split but to also bring their own featurizer.\n",
    "It can be initialized with the following parameters\n",
    "```\n",
    "Parameters\n",
    "    ----------\n",
    "    df : pd.DataFrame | None\n",
    "        the split to use\n",
    "    split : str\n",
    "        the split to sample from\n",
    "    split_parquet_path : str | Path, default=None\n",
    "        split parquet file\n",
    "    input_structure_priority : str, default=\"apo\"\n",
    "        Which alternate structure to proritize\n",
    "    featurizer: Callable[\n",
    "            [Structure, int], dict[str, torch.Tensor]\n",
    "    ] = structure_featurizer,\n",
    "        Transformation to turn structure to input tensors\n",
    "    padding_value : int\n",
    "        Value for padding uneven array\n",
    "    **kwargs : Any\n",
    "        Any other keyword args\n",
    "```\n",
    "\n",
    "For an example of how to write your own featurizer see [Featurizer Example](https://github.com/plinder-org/plinder/blob/c36eef9b02823ce572de905c094f6c85c03800ca/src/plinder/core/loader/featurizer.py#L16). The signature is shown below:\n",
    "```\n",
    "def structure_featurizer(\n",
    "    structure: Structure, pad_value: int = -100\n",
    "    ) -> dict[str, Any]:\n",
    "```\n",
    "The input is a `Structure` object and it returns dictionary of padded tensor features.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "This is where you may want to load a `train` dataset, but for the purposes of demonstration - we will start with `val` due to smaller memory footprint, and load only a small subset of systems containing `ATP` as ligand. We also set `use_alternate_structures=False` to prevent downloading and loading alternate structures for the docs.\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val_dataset = PlinderDataset(\n",
    "    split=\"val\",\n",
    "    filters=[\n",
    "        (\"system_num_protein_chains\", \"==\", 1),\n",
    "        (\"ligand_unique_ccd_code\", \"in\", {\"ATP\"}),\n",
    "    ],\n",
    "    use_alternate_structures=False,\n",
    ")\n",
    "len(val_dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val_data = val_dataset[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val_loader = get_torch_loader(val_dataset)\n",
    "for data in val_loader:\n",
    "    test_torch = data\n",
    "    break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_torch.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_torch[\"system_ids\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for k, v in test_torch[\"features_and_coords\"].items():\n",
    "    print(k, v.shape)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}