Submitted by somebodyenjoy t3_z8otan in deeplearning

I was thinking I'd load half the data first, train it, then another half, and train it. This may be slightly slower but should work in theory. I'd preprocess it and store the data in something like X1.npy and X2.npy. X1 and X2 being the first and second half of the preprocessed data. This can make it so data is loaded much quicker as well, but obviously slower than if we had bigger RAM. We can always get more RAM in the cloud, but what if we have 1000GB of images to train on? Seems like my initial intuition is correct, but what is the standard operating procedure here?

​

I think people normally let Keras do all the work by simply using ImageDataGenerator and feeding the path, but what if I want some control over preprocessing?

15

Comments

You must log in or register to comment.

Alone_Bee_6221 t1_iycimeo wrote

I would probably suggest splitting into chunks of data or you could try to implement you own dataset class to load images lazily.

10

incrediblediy t1_iycjg6d wrote

You can use your own preprocessing on top of keras preprocessing and data loader, or you can use a custom code for all together.
According to https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator ,

Deprecated: tf.keras.preprocessing.image.ImageDataGenerator is not recommended for new code. Prefer loading images with tf.keras.utils.image_dataset_from_directory and transforming the output tf.data.Dataset with preprocessing layers

You can do mini batch training depending on available VRAM, even with a batch size of 1. I assume you are referring to VRAM as RAM, as we hardly do deep learning with CPU for image datasets.

example: you can use data_augmentation pipeline step to have control over preprocessing like this (I used this code with older TF version (2.4.0 or 2.9.0.dev may be) and might need to change function locations for new version as above)

train_ds = tensorflow.keras.preprocessing.image_dataset_from_directory(
    image_directory,
    labels='inferred', 
    label_mode='int',
    class_names=classify_names,     
    validation_split=0.3,
    subset="training",
    shuffle=shuffle_value,
    seed=seed_value,
    image_size=image_size,
    batch_size=batch_size,
)

data_augmentation = tensorflow.keras.Sequential(
    [
        tensorflow.keras.layers.experimental.preprocessing.RandomFlip("horizontal"),
        tensorflow.keras.layers.experimental.preprocessing.RandomRotation(0.1), 
    ]
)

augmented_train_ds = train_ds.map( lambda x, y: (data_augmentation(x, training=True), y))
2

suflaj t1_iyclnwf wrote

Images are loaded from disk, perhaps with some caching.

The most efficient simple solution would be to have workers that fill up a buffer that acts like a queue for data.

1

IshanDandekar t1_iycnrjg wrote

How big is your RAM? Maybe you can try cloud resources to get a better machine, leverage GPUs too if it is an image dataset

0

Ttttrrrroooowwww t1_iyctkhw wrote

Normally your dataloader gets single samples from your dataset. Such as reading an image one by one. In that case RAM is never a problem.

If that is not an option for you (why I would not know), then numpy memmaps might be for you. Basically an array thats read from disk, not from RAM. I use it to handle arrays that are Billions of values.

2

robbsc t1_iycws54 wrote

For tensorflow, you have to learn to use tensorflow datasets. https://www.tensorflow.org/datasets

You could also save your dataset as an hdf5 file using h5py, then use the tensorflow_io from_hdf5() to load your data. https://www.tensorflow.org/io

Hdf5 is the "traditional" (for lack of a better word) way of loading numpy data that is too big to fit in memory. The downside is that it is slow at random indexing, so people don't use it as much anymore for training networks.

Pytorch datasets are a little easier in my opinion.

8

somebodyenjoy OP t1_iycyur2 wrote

I do the same using numpy files, but they only let me load the whole data which is too big in the first place. Tensorflow let’s us load in batches huh, I’ll look into this

2

Rishh3112 t1_iyd0opv wrote

I would suggest on splitting the dataset and saving the weights everytime you train one set and train the next set using the weights of the previous weights.

0

HiPattern t1_iyd91t0 wrote

hdf5 files are quite nice for that. You can write your X / y datasets in chunks into the file. When you access a batch, then it will only read the part of the hdf5 file where the batch is.

​

You can also use multiple numpy files, e.g. one for each batch, and then handle the file management in the sequence generator.

3