Downloading the dataset#
The default location for the dataset is ~/.local/share/pinder/<PLINDER_RELEASE>/<PLINDER_ITERATION>
where:
<PLINDER_RELEASE>
is the date of the PDB sync used to generate the dataset (e.g 2024-06)<PLINDER_ITERATION>
is the iteration of the source code that generated the dataset (e.g.v2
)
If you want to use a different location, you can do so by setting the PLINDER_MOUNT
environment variable.
Use plinder_download
to download the complete dataset#
plinder_download --help
usage:
Download the full plinder dataset for the current configuration.
Note that even though this is wrapped in a progress bar, the estimated
completion time can vary wildly as it iterates over larger files vs.
smaller ones.
optional arguments:
-h, --help show this help message and exit
--release RELEASE plinder release
--iteration ITERATION
plinder iteration
-y, --yes skip confirmation
Note
This will take around an hour to complete and downloads around 1TB of data. But after this is done you can set the PLINDER_OFFLINE
environment variable to true
to avoid downloading the data again. :::
Use the plinder
Python package to lazily access the dataset#
Alternatively, if the PLINDER_OFFLINE
environment variable unset (which is the default empty), the dataset will be downloaded on lazily and on the fly as you access the data. This is preferred for exploration and prototyping as you don’t need to download the entire dataset at once and can just work with the assets you need for your use-case.
import plinder.core.utils.config
cfg = plinder.core.get_config()
print(f"local directory: {cfg.data.plinder_dir}")
print(f"remote data directory: {cfg.data.plinder_remote}")
local directory: /home/runner/.local/share/plinder/2024-06/v2
remote data directory: gs://plinder/2024-06/v2