davidbun t1_jdr1c41 wrote on March 26, 2023 at 2:53 PM

Reply to comment by elnaqnely in [D] Title: Best tools and frameworks for working with million-billion image datasets? by v2thegreat

that's a good option. Or just use Deep Lake, built specifically for that purpose, and able to visualize, version control, or query your image data (full disclosure - one of the creators of the product/OSS project).

davidbun t1_jdr115n wrote on March 26, 2023 at 2:51 PM

Reply to [D] Title: Best tools and frameworks for working with million-billion image datasets? by v2thegreat

Full disclosure, I'm one of the creators of the project, but this is exactly why we've built Deep Lake, the Data Lake for Deep Learning. It addresses all your concerns. Specifically:

- Works with any framework (PyTorch, TensorFlow - you might also want to look into training models with MMDetection)- Stores (and visualizes!) all your data, together with your metadata.

- Outperforms Zarr (we built on top of it in v1, but sadly were constrained a lot by it, so had to build everything from scratch), as well as various dataloaders in a variety of use cases.

- Achieves near full or full GPU utilization regardless of scale (think LAION-400B images battle-tested). This is regardless of which cloud you store your images on and where do you train your model, e.g., streaming from EC2 to AWS Sagemaker and achieving full GPU utilization at half the cost (no GPU idle time due to streaming capability).