Hi everyone,

I'm working on a project that involves working with image datasets that have tens of thousands to millions of images.I'm looking for some advice and recommendations on the best tools and frameworks to use for this task. Here are some of the questions I have:

- What are the best tools for storing and accessing such large image datasets? I've used NetCDFs and Zarrs in the past, but most image-processing libraries like sci-kit-image or opencv don't support it. Do you guys just store all your images in a massive data lake?

- I'm familiar with TensorFlow, but I'm sick of its issues it's got a ton of lacking functionality that seems broken or abandoned, such as gradient checkpointing, and its lack of transparency with underlying functionality. I know Pytorch exists, but I feel like there's a higher learning curve to it. Is there a Keras equivalent to Pytorch?

- Is there any way to accelerate the image processing tasks using a GPU? I know GPUs are mainly used for training models, but I'm wondering if there is any benefit or possibility of using them for image processing as well. If so, how can I do that?

- Is there any way to meaningfully store the image dataset as some form of a database with all of its features in one place? I'm interested in having a structured and searchable way to access the images and their metadata, such as labels, captions, annotations, etc.

I wanna mention that I've spent a LOT of time reading up on these things and haven't been able to find a suitable answer, so I'm posting this here as a final resort

Comments

You must log in or register to comment.

davidbun t1_jdr115n wrote on March 26, 2023 at 2:51 PM

Full disclosure, I'm one of the creators of the project, but this is exactly why we've built Deep Lake, the Data Lake for Deep Learning. It addresses all your concerns. Specifically:

- Works with any framework (PyTorch, TensorFlow - you might also want to look into training models with MMDetection)- Stores (and visualizes!) all your data, together with your metadata.

- Outperforms Zarr (we built on top of it in v1, but sadly were constrained a lot by it, so had to build everything from scratch), as well as various dataloaders in a variety of use cases.

- Achieves near full or full GPU utilization regardless of scale (think LAION-400B images battle-tested). This is regardless of which cloud you store your images on and where do you train your model, e.g., streaming from EC2 to AWS Sagemaker and achieving full GPU utilization at half the cost (no GPU idle time due to streaming capability).

elnaqnely t1_jdpty1s wrote on March 26, 2023 at 6:22 AM

> accelerate the image processing tasks using a GPU

You can find some working code to do simple manipulations of images (scaling, flipping, cropping) on a GPU. Search for "gpu image augmentation".

> image dataset as some form of a database

With millions of images, the metadata alone may be difficult to navigate. I recommend storing the images/metadata on a good SSD (plus a backup), with the metadata in Parquet format, partitioned by categories that are meaningful to you. That will allow the metadata to be efficiently queried using Arrow or Spark, both of which have Python wrappers (pyarrow, pyspark).

For the images themselves, store them in a similar nested directory structure to match the metadata. This means your images will be grouped by the same meaningful attributes you chose to partition the metadata. Also, this will hopefully keep the number of images per directory from becoming too large. Doing that will allow you to browse thumbnails using whatever file browser comes with your operating system. To rapidly page through thousands of images, I found that the default Ubuntu image viewer, Eye of Gnome, works really well.

davidbun t1_jdr1c41 wrote on March 26, 2023 at 2:53 PM

that's a good option. Or just use Deep Lake, built specifically for that purpose, and able to visualize, version control, or query your image data (full disclosure - one of the creators of the product/OSS project).

karius85 t1_jdpz4k7 wrote on March 26, 2023 at 7:36 AM

Pytorch Lightning is a simpler alternative to PyTorch and Kornia has a lot of standard scikit-image/opencv implementations for PyTorch on GPU, many of which support autograd. Webdataset is scheduled for inclusion into PyTorch, and uses sharded tar files with label data, but is currently underdocumented to some extent.

mLalush t1_jdqc5kk wrote on March 26, 2023 at 10:47 AM

https://github.com/webdataset/webdataset

The above library is getting integrated into torchdata, and will become part of Pytorch stack eventually.

theogognf t1_jdqx32u wrote on March 26, 2023 at 2:20 PM

You can use whatever you want for the actual data storage. As other comments have mentioned and at least in the PyTorch space, it really just comes down to you defining a dataset object that samples or pulls data from your data storage whenever it's indexed so you aren't using all your RAM on just storing your dataset. That data storage can be a relational database or non-relational database, just files on your local system, or files on a database file system for a cloud provider; it doesn't really matter so long as you can quickly load samples into memory. With billions of images, you may want to look into using a cloud provider for at least storing your dataset (depending on their size)

You can certainly preprocess your data and store it in a processed format if you want to and if you think that's a bottleneck in your data loading. It sounds like you should focus on figuring out how to store your data first though

Regarding ML frameworks, that just comes down to preference. I usually see PyTorch for experimentation with custom models, while I see TensorFlow for mature models that're being deployed/served

FirstBabyChancellor t1_jdq6cbb wrote on March 26, 2023 at 9:22 AM

Look into Nvidia DALI. It's moreso designed as a highly efficient and faster alternative to PyTorch's default dataloader, but you can also use it to do a number of preprocessing operations on images -- using GPUs.