Submitted by KarmaCut132 t3_10mab6p in MachineLearning
I'm new to CTC. After learning about CTC and its application in End2End training for Speech Recognition, I figured that if we want to generate a target sequence (transcript), given a source sequence features, we could use the vanilla Encoder-Decoder architecture in Transformer (also used in T5, BART, etc) alone, without the need of CTC, yet why people are only using CTC for End2End Speech Recoginition, or using hybrid of CTC and Decoder in some papers ?
Thanks.
albertzeyer t1_j65rtdq wrote
What do you mean? There are many such papers where people only use attention-based encoder-decoder (AED) for speech recognition. Some random papers:
See my Phd thesis for some overview over CTC, AED, RNN-T and other approaches: https://www-i6.informatik.rwth-aachen.de/publications/download/1223/Zeyer--2022.pdf
I call this "sequence-to-sequence architecture".
I think most people nowadays use RNN-T.
Some people use CTC just because of its simplicity, and also it might be more stable, behave more sane on long sequences, where AED might break, and online streaming is simpler than AED.
AED is clearly better than CTC. But RNN-T is also better than CTC.
Of course, a combination is yet better than both. So AED+CTC is better than both AED or CTC alone. And ESPnet, a very popular open source framework, has this implemented, so many people just use that.