I have a dataset where each audio file is around 30 minutes long. I need to classify the audio files into 6 categories and the inference time needs to be fast - not more than 1 second. Most of the audio classification techniques that I have come across use MFCC or Mel Spectrograms. Producing an MFCC or Mel Spectrogram for the entire 30 minutes is time consuming. So I am suspecting I have to classify the audio file based on short clips extracted from the file. Now, the success of the classification task would depend on how representative the short clips are of the original audio file. Maybe the short clips can be extracted based on audio features that aren't too expensive to compute - RMS for example. But I'm not aware of any existing work that has been in this field. A quick Google search and scanning of Google Scholar didn't give me anything useful. So it would greatly benefit me if someone could point me towards any existing work done in this field.

Comments

aman5319 t1_ir0v0em wrote on October 4, 2022 at 3:00 PM

#28,842

I would also like to know answer to this question.

bklawa t1_ir0wlyp wrote on October 4, 2022 at 3:11 PM

#28,908

Some ideas:

Down sample the audio to lower sample rate (if it is 48Khz, perhaps try 8Khz). This really depends on the task (music, speech, other general audio recordings...).
You don't need to feed the whole spectrogram of 30 min to the model for classification. A alternative would be to reduce the time axis by applying the mean or max for example, at the end you will end up with a very small vector. Otherwise you can also do it over splits of 1 mins segments to try keeping more information. But this will definitely help reducing the model size.
You can clip the portions of the audio track that are "silent" or under a certain energy threshold before applying the steps above.

Hope this helps

time_waster103 OP t1_ir0xlzf wrote on October 4, 2022 at 3:17 PM

#28,962

Replying to bklawa (#28,908)

Thanks for the ideas

mrobo_5ht2a t1_ir1b24s wrote on October 4, 2022 at 4:43 PM

#29,558

You could store the audio clips as a Numpy memory map, and sample from it. This way, you will be able to sample clips without loading everything into memory. One library you can use is mmap-ninja.

Disclaimer: I'm the author of mmap-ninja.

[deleted] t1_ir1nb8r wrote on October 4, 2022 at 6:00 PM

#30,113

[deleted]

alex_lite_21 t1_ir25awj wrote on October 4, 2022 at 7:52 PM

#30,950

1 second for a 30 minutes audio file?? Why do you need it so fast? What kind of diversity is among the audio files? It depends on how different are among classes and similar within each class. I guess that second does not consider the time spent to load the file is it?

time_waster103 OP t1_ir3djh7 wrote on October 5, 2022 at 1:01 AM

#32,941

Replying to alex_lite_21 (#30,950)

Think of an application where user uploads audio to their cloud storage and it is automatically categorised. The exact time limit hasn't been decided yet. My task is to do some literature review and experiments to find out the least time needed for fairly high classification accuracy.

I said 1 second based on the amount of time beyond which the user might feel labelling the file himself is more convenient.

alex_lite_21 t1_ir3k9cb wrote on October 5, 2022 at 1:54 AM

#33,339

Replying to time_waster103 (#32,941)

Could it be that during the loading process you can be extracting some features (windowed)

time_waster103 OP t1_ir3kt66 wrote on October 5, 2022 at 1:58 AM

#33,368

Replying to alex_lite_21 (#33,339)

I'm not sure how to implement it, but yes it should be allowed.

ccrdallas t1_ir5d2pn wrote on October 5, 2022 at 1:40 PM

#36,153

If you have inexpensive features then you could try and build a Determinantal Point Process, a model for sampling diverse subsets (diversity defined via the features and an appropriate kernel). The downside is that this method, at worst, typically scales poorly (O(N^3 )) with the number of samples although there is recent work in this field to speed up inference. I can’t say a priori if it is fast enough for your task.