RainbowRedditForum

RainbowRedditForum t1_jd64o4e wrote

A CRNN is trained with logmel as input, calculated as follows:
the input audio is split in 30ms frames with 10ms hop size, and 40 logmel are calculated for each frame.
The CRNN performs a binary classification.
With this setup, are these two considerations true?

  • two consecutive output labels generated by the CRNN are associated with two overlapped audio frames (each of size 30ms (0.03s) and hop size 10ms);
  • for 10 minutes audio the CRNN should generate about 30000 output labels, each one associated with a 30ms frame with 10ms of overlap
1