NumpyFileDataset

class axtreme.data.numpy_file_dataset.NumpyFileDataset(root_dir: str | Path)

Bases: Dataset[Tensor]

Helper to work with directories of .npy data.

Note

  • Highly recommened to use an in memory dataset if possible. This is typically a bottleneck.

  • Using with a sequential sampler will be significantly faster because this performs rudimental cacheing.
    • Random sampling will require from disk read for EVERY datapoint. Suggest randomise the save files.

Assumes:
  • Each row is a data point

  • Each file has the same number of datapoints within it.

Dev:

Todo

  • This is slow compared to reading from memory. 100k dataset (3ms for memory, 20s with this dataloader)
    • This is because EVERY SINGLE datapoint requireds a new read from disk.

    • TODO: Consider different file types that might be more appropriate (row specfic access)
      • HDF5?

    • TODO: consider intergration with an imporance weight
      • Look at the example with images, can return many things

      • npz allows you to combine multiple arrays, could have soem logic if other array none, no imporance.

Answered:
  • Is Sampler a more approapriate way of framing this?
    • No. The samples class just take an existing dataset and shuffles it. Like shuffle in dataloader.

__init__(root_dir: str | Path) None

Initialise the Dataset.

Note

Data should be loaded lazily (in __getitem__, not here)

Parameters:

root_dir (string) – Directory with .npy files

Methods

__init__(root_dir)

Initialise the Dataset.