theogognf t1_jdqx32u wrote on March 26, 2023 at 2:20 PM

You can use whatever you want for the actual data storage. As other comments have mentioned and at least in the PyTorch space, it really just comes down to you defining a dataset object that samples or pulls data from your data storage whenever it's indexed so you aren't using all your RAM on just storing your dataset. That data storage can be a relational database or non-relational database, just files on your local system, or files on a database file system for a cloud provider; it doesn't really matter so long as you can quickly load samples into memory. With billions of images, you may want to look into using a cloud provider for at least storing your dataset (depending on their size)

You can certainly preprocess your data and store it in a processed format if you want to and if you think that's a bottleneck in your data loading. It sounds like you should focus on figuring out how to store your data first though

Regarding ML frameworks, that just comes down to preference. I usually see PyTorch for experimentation with custom models, while I see TensorFlow for mature models that're being deployed/served