Submitted by IdeaEnough443 t3_zg6s6d in MachineLearning
linearmodality t1_izgkb2p wrote
How big is your dataset? The right answer will depend on the size.
IdeaEnough443 OP t1_izgr37h wrote
greater than 700GB potentially 10TB scale, it won't fit on single machine memory
ab3rratic t1_izgrfxy wrote
Batch gradient descent (the usual method) does not require the entire dataset to fit into memory -- only one batch, as it were.
IdeaEnough443 OP t1_izgwvp5 wrote
but the training process would be slower than parallelization? is batch gradient descent the industry standard for handling large dataset in nn training?
ab3rratic t1_izh1s2j wrote
See "deep learning".
PassionatePossum t1_izi24ow wrote
You can still parallelize using batch gradient descent. If you for example use the MirroredStrategy in Tensorflow you split up the batch between multiple GPUs. The only downside is, that this approach doesn’t scale well if you want to train on more than one machine since the model needs to be synced after each iteration.
But you should think long and hard whether training on multiple machines is really necessary since that brings a whole new set of problems. 700GB is not that large. We do that all the time. I don’t know what kind of model you are trying to train but we have a GPU Server with 8 GPUs and I’ve never felt the need to go beyond the normal MirroredStrategy for parallelization. And should you run into the problem that you cannot fit the data onto the machine where you are training: Load it over the network.
You just need to make sure that your input pipeline supports that efficiently. Shard your dataset so you can have many concurrent I/O operations.
And in case scaling is really important to you. May I suggest you look into Horovod?
SwordOfVarjo t1_izgx533 wrote
It's the industry standard for NN training period. Your dataset isn't that big, just train on one machine.
IdeaEnough443 OP t1_izgyjq8 wrote
our datset take close to a day to finish training, if we have 5x the data it won't work with our application, thats why we are trying to see if distributed training would help lower training time
Viewing a single comment thread. View all comments