theogognf
theogognf t1_jdqx32u wrote
Reply to [D] Title: Best tools and frameworks for working with million-billion image datasets? by v2thegreat
You can use whatever you want for the actual data storage. As other comments have mentioned and at least in the PyTorch space, it really just comes down to you defining a dataset object that samples or pulls data from your data storage whenever it's indexed so you aren't using all your RAM on just storing your dataset. That data storage can be a relational database or non-relational database, just files on your local system, or files on a database file system for a cloud provider; it doesn't really matter so long as you can quickly load samples into memory. With billions of images, you may want to look into using a cloud provider for at least storing your dataset (depending on their size)
You can certainly preprocess your data and store it in a processed format if you want to and if you think that's a bottleneck in your data loading. It sounds like you should focus on figuring out how to store your data first though
Regarding ML frameworks, that just comes down to preference. I usually see PyTorch for experimentation with custom models, while I see TensorFlow for mature models that're being deployed/served
theogognf t1_jdqy28l wrote
Reply to [D] Keeping track of ML advancements by Anis_Mekacher
I stay up-to-date mainly by browsing https://paperswithcode.com/ in the morning and once a week at work. There have definitely been a good number of times that I stumble across some new method or repo to play around with for my main area of interest that ends up having some immediate return. I occasionally browse by all topics there, but I usually only filter by my main interests. I can't imagine staying current without some other similar site