Submitted by v2thegreat t3_12285x7 in MachineLearning
Hi everyone,
I'm working on a project that involves working with image datasets that have tens of thousands to millions of images.I'm looking for some advice and recommendations on the best tools and frameworks to use for this task. Here are some of the questions I have:
- What are the best tools for storing and accessing such large image datasets? I've used NetCDFs and Zarrs in the past, but most image-processing libraries like sci-kit-image or opencv don't support it. Do you guys just store all your images in a massive data lake?
- I'm familiar with TensorFlow, but I'm sick of its issues it's got a ton of lacking functionality that seems broken or abandoned, such as gradient checkpointing, and its lack of transparency with underlying functionality. I know Pytorch exists, but I feel like there's a higher learning curve to it. Is there a Keras equivalent to Pytorch?
- Is there any way to accelerate the image processing tasks using a GPU? I know GPUs are mainly used for training models, but I'm wondering if there is any benefit or possibility of using them for image processing as well. If so, how can I do that?
- Is there any way to meaningfully store the image dataset as some form of a database with all of its features in one place? I'm interested in having a structured and searchable way to access the images and their metadata, such as labels, captions, annotations, etc.
I wanna mention that I've spent a LOT of time reading up on these things and haven't been able to find a suitable answer, so I'm posting this here as a final resort
elnaqnely t1_jdpty1s wrote
> accelerate the image processing tasks using a GPU
You can find some working code to do simple manipulations of images (scaling, flipping, cropping) on a GPU. Search for "gpu image augmentation".
> image dataset as some form of a database
With millions of images, the metadata alone may be difficult to navigate. I recommend storing the images/metadata on a good SSD (plus a backup), with the metadata in Parquet format, partitioned by categories that are meaningful to you. That will allow the metadata to be efficiently queried using Arrow or Spark, both of which have Python wrappers (pyarrow, pyspark).
For the images themselves, store them in a similar nested directory structure to match the metadata. This means your images will be grouped by the same meaningful attributes you chose to partition the metadata. Also, this will hopefully keep the number of images per directory from becoming too large. Doing that will allow you to browse thumbnails using whatever file browser comes with your operating system. To rapidly page through thousands of images, I found that the default Ubuntu image viewer, Eye of Gnome, works really well.