Helveticus99 OP t1_j1qmmtg wrote on December 26, 2022 at 4:58 PM

Reply to comment by sigmoid_amidst_relus in [D] Classification task based on speech recordings by Helveticus99

Thank you so much for your input u/sigmoid_amidst_relus. I will consider Mel-Spectrograms instead of MFCCs. Do you know what the maximum size of a Mel-Spectrogram is in terms of seconds it covers?

With mental state I'm not referring to emotions that change fast but to more a long-term state that is reflected in the whole 1 hour recording. Thus, I think repeating the label for every frame might not work well. I might have to extract features over the full recording. That's also why I think an autoencoder can be problematic.

I could divide the recording into frames and stack the Mel-Spectrograms of the frames (using a 3D CNN). The problem is that I will end up with a huge number of frames. Same problem when considering a RNN, I will end up with a huge time series.

Using features from a large pretrained model is interesting. Can you recommend a pretrained model that is suitable for feature extraction from long recordings?

Helveticus99 OP t1_j1qjdxl wrote on December 26, 2022 at 4:34 PM

Reply to comment by shadow_fax1024 in [D] Classification task based on speech recordings by Helveticus99

Thank you u/shadow_fax1024. How did you handle audio files with different length? And how did you handle the long audio files exactly? I think creating a Mel-Spectrograms over long audio files won't work.

Helveticus99 OP t1_j1pu570 wrote on December 26, 2022 at 12:48 PM

Reply to comment by shadow_fax1024 in [D] Classification task based on speech recordings by Helveticus99

Thank you u/shadow_fax1024. Did you use a RNN or a plain CNN? Did you also had that long audio files (40min - 60min)? I'm not sure about how such long audio files can be used in a RNN.