Submitted by v2thegreat t3_12285x7 in MachineLearning
Hi everyone,
I'm working on a project that involves working with image datasets that have tens of thousands to millions of images.I'm looking for some advice and recommendations on the best tools and frameworks to use for this task. Here are some of the questions I have:
- What are the best tools for storing and accessing such large image datasets? I've used NetCDFs and Zarrs in the past, but most image-processing libraries like sci-kit-image or opencv don't support it. Do you guys just store all your images in a massive data lake?
- I'm familiar with TensorFlow, but I'm sick of its issues it's got a ton of lacking functionality that seems broken or abandoned, such as gradient checkpointing, and its lack of transparency with underlying functionality. I know Pytorch exists, but I feel like there's a higher learning curve to it. Is there a Keras equivalent to Pytorch?
- Is there any way to accelerate the image processing tasks using a GPU? I know GPUs are mainly used for training models, but I'm wondering if there is any benefit or possibility of using them for image processing as well. If so, how can I do that?
- Is there any way to meaningfully store the image dataset as some form of a database with all of its features in one place? I'm interested in having a structured and searchable way to access the images and their metadata, such as labels, captions, annotations, etc.
I wanna mention that I've spent a LOT of time reading up on these things and haven't been able to find a suitable answer, so I'm posting this here as a final resort
davidbun t1_jdr115n wrote
Full disclosure, I'm one of the creators of the project, but this is exactly why we've built Deep Lake, the Data Lake for Deep Learning. It addresses all your concerns. Specifically:
- Works with any framework (PyTorch, TensorFlow - you might also want to look into training models with MMDetection)- Stores (and visualizes!) all your data, together with your metadata.
- Outperforms Zarr (we built on top of it in v1, but sadly were constrained a lot by it, so had to build everything from scratch), as well as various dataloaders in a variety of use cases.
- Achieves near full or full GPU utilization regardless of scale (think LAION-400B images battle-tested). This is regardless of which cloud you store your images on and where do you train your model, e.g., streaming from EC2 to AWS Sagemaker and achieving full GPU utilization at half the cost (no GPU idle time due to streaming capability).