Submitted by Helveticus99 t3_zvb6l5 in MachineLearning
Helveticus99 OP t1_j1qmmtg wrote
Reply to comment by sigmoid_amidst_relus in [D] Classification task based on speech recordings by Helveticus99
Thank you so much for your input u/sigmoid_amidst_relus. I will consider Mel-Spectrograms instead of MFCCs. Do you know what the maximum size of a Mel-Spectrogram is in terms of seconds it covers?
With mental state I'm not referring to emotions that change fast but to more a long-term state that is reflected in the whole 1 hour recording. Thus, I think repeating the label for every frame might not work well. I might have to extract features over the full recording. That's also why I think an autoencoder can be problematic.
I could divide the recording into frames and stack the Mel-Spectrograms of the frames (using a 3D CNN). The problem is that I will end up with a huge number of frames. Same problem when considering a RNN, I will end up with a huge time series.
Using features from a large pretrained model is interesting. Can you recommend a pretrained model that is suitable for feature extraction from long recordings?
Viewing a single comment thread. View all comments