NumpyFileDataset¶
- class axtreme.data.numpy_file_dataset.NumpyFileDataset(root_dir: str | Path)¶
Bases:
Dataset
[Tensor
]Helper to work with directories of .npy data.
Note
Highly recommened to use an in memory dataset if possible. This is typically a bottleneck.
- Using with a sequential sampler will be significantly faster because this performs rudimental cacheing.
Random sampling will require from disk read for EVERY datapoint. Suggest randomise the save files.
- Assumes:
Each row is a data point
Each file has the same number of datapoints within it.
- Dev:
Based on example here: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
Todo
- This is slow compared to reading from memory. 100k dataset (3ms for memory, 20s with this dataloader)
This is because EVERY SINGLE datapoint requireds a new read from disk.
- TODO: Consider different file types that might be more appropriate (row specfic access)
HDF5?
- TODO: consider intergration with an imporance weight
Look at the example with images, can return many things
npz allows you to combine multiple arrays, could have soem logic if other array none, no imporance.
- Answered:
- Is Sampler a more approapriate way of framing this?
No. The samples class just take an existing dataset and shuffles it. Like shuffle in dataloader.
- __init__(root_dir: str | Path) None ¶
Initialise the Dataset.
Note
Data should be loaded lazily (in __getitem__, not here)
- Parameters:
root_dir (string) – Directory with .npy files
Methods
__init__
(root_dir)Initialise the Dataset.