RainbowRedditForum t1_jd64o4e wrote on March 22, 2023 at 2:31 AM

Reply to [D] Simple Questions Thread by AutoModerator

A CRNN is trained with logmel as input, calculated as follows:
the input audio is split in 30ms frames with 10ms hop size, and 40 logmel are calculated for each frame.
The CRNN performs a binary classification.
With this setup, are these two considerations true?

two consecutive output labels generated by the CRNN are associated with two overlapped audio frames (each of size 30ms (0.03s) and hop size 10ms);
for 10 minutes audio the CRNN should generate about 30000 output labels, each one associated with a 30ms frame with 10ms of overlap