# Datasets and Dataloaders

In [None]:
from plinder.core.loader.dataset import PlinderDataset, get_torch_loader


`PlinderDataset` provides an interface to interact with _PLINDER_ data as a dataset. It is a subclass of `torch.utils.data.Dataset`, as such subclassing it and extending should be familiar to most users. Flexibility and general applicability is our top concern when designing this interface and `PlinderDataset` allows users to not only define their own split but to also bring their own featurizer.
It can be initialized with the following parameters
```
Parameters
 ----------
 df : pd.DataFrame | None
 the split to use
 split : str
 the split to sample from
 split_parquet_path : str | Path, default=None
 split parquet file
 input_structure_priority : str, default="apo"
 Which alternate structure to proritize
 featurizer: Callable[
 [Structure, int], dict[str, torch.Tensor]
 ] = structure_featurizer,
 Transformation to turn structure to input tensors
 padding_value : int
 Value for padding uneven array
 **kwargs : Any
 Any other keyword args
```

For an example of how to write your own featurizer see [Featurizer Example](https://github.com/plinder-org/plinder/blob/c36eef9b02823ce572de905c094f6c85c03800ca/src/plinder/core/loader/featurizer.py#L16). The signature is shown below:
```
def structure_featurizer(
 structure: Structure, pad_value: int = -100
 ) -> dict[str, Any]:
```
The input is a `Structure` object and it returns dictionary of padded tensor features.


:::{note}
This is where you may want to load a `train` dataset, but for the purposes of demonstration - we will start with `val` due to smaller memory footprint, and load only a small subset of systems containing `ATP` as ligand. We also set `use_alternate_structures=False` to prevent downloading and loading alternate structures for the docs.
:::

In [None]:
val_dataset = PlinderDataset(
 split="val",
 filters=[
 ("system_num_protein_chains", "==", 1),
 ("ligand_unique_ccd_code", "in", {"ATP"}),
 ],
 use_alternate_structures=False,
)
len(val_dataset)

In [None]:
val_data = val_dataset[1]

In [None]:
val_loader = get_torch_loader(val_dataset)
for data in val_loader:
 test_torch = data
 break

In [None]:
test_torch.keys()

In [None]:
test_torch["system_ids"]

In [None]:
for k, v in test_torch["features_and_coords"].items():
 print(k, v.shape)