Downloading the dataset#

The default location for the dataset is ~/.local/share/pinder/<PLINDER_RELEASE>/<PLINDER_ITERATION> where:

  • <PLINDER_RELEASE> is the date of the PDB sync used to generate the dataset (e.g 2024-06)

  • <PLINDER_ITERATION> is the iteration of the source code that generated the dataset (e.g. v2)

If you want to use a different location, you can do so by setting the PLINDER_MOUNT environment variable.

Use plinder_download to download the complete dataset#

plinder_download --help
usage:
    Download the full plinder dataset for the current configuration.
    Note that even though this is wrapped in a progress bar, the estimated
    completion time can vary wildly as it iterates over larger files vs.
    smaller ones.
    

optional arguments:
  -h, --help            show this help message and exit
  --release RELEASE     plinder release
  --iteration ITERATION
                        plinder iteration
  -y, --yes             skip confirmation

Note

This will take around an hour to complete and downloads around 1TB of data. But after this is done you can set the PLINDER_OFFLINE environment variable to true to avoid downloading the data again. :::

Use the plinder Python package to lazily access the dataset#

Alternatively, if the PLINDER_OFFLINE environment variable unset (which is the default empty), the dataset will be downloaded on lazily and on the fly as you access the data. This is preferred for exploration and prototyping as you don’t need to download the entire dataset at once and can just work with the assets you need for your use-case.

import plinder.core.utils.config

cfg = plinder.core.get_config()
print(f"local directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")
local directory: /home/runner/.local/share/plinder/2024-06/v2
remote data directory: gs://plinder/2024-06/v2