I'm new to CTC. After learning about CTC and its application in End2End training for Speech Recognition, I figured that if we want to generate a target sequence (transcript), given a source sequence features, we could use the vanilla Encoder-Decoder architecture in Transformer (also used in T5, BART, etc) alone, without the need of CTC, yet why people are only using CTC for End2End Speech Recoginition, or using hybrid of CTC and Decoder in some papers ?
Thanks.

Comments

albertzeyer t1_j65rtdq wrote on January 27, 2023 at 10:22 PM

What do you mean? There are many such papers where people only use attention-based encoder-decoder (AED) for speech recognition. Some random papers:

See my Phd thesis for some overview over CTC, AED, RNN-T and other approaches: https://www-i6.informatik.rwth-aachen.de/publications/download/1223/Zeyer--2022.pdf

I call this "sequence-to-sequence architecture".

I think most people nowadays use RNN-T.

Some people use CTC just because of its simplicity, and also it might be more stable, behave more sane on long sequences, where AED might break, and online streaming is simpler than AED.

AED is clearly better than CTC. But RNN-T is also better than CTC.

Of course, a combination is yet better than both. So AED+CTC is better than both AED or CTC alone. And ESPnet, a very popular open source framework, has this implemented, so many people just use that.

JustOneAvailableName t1_j683rg1 wrote on January 28, 2023 at 12:08 PM

> I think most people nowadays use RNN-T.

Isn't slightly finetuning Whisper the go to?

albertzeyer t1_j6a8qvq wrote on January 28, 2023 at 9:48 PM

I mean in the research community, and also all the big players who actually have speech recognition in products, like Google, Apple, Microsoft, Amazon, etc.

Whisper is nice for others. However, as an AED model, it has some disadvantages over an RNN-T model. E.g. it does not work well for streaming (getting instant recognition results, usually within 100ms, or 500ms, or max 1sec). Also, I'm quite sure it has some strange failure cases, as AED models tend to have, like repeating some labels, or skipping to the end of a sequence (or just chunk) when it got confused.

JustOneAvailableName t1_j6cfdmr wrote on January 29, 2023 at 9:43 AM

I worked with Wav2vec a year ago. WER on dutch was (noticeably) better when fine tuned than it was with GCP or Azure, and we didn't use any labeled own data. I used CTC mainly because it didn't reduce WER, hugely improved CER and made inference lots simpler. Inference cost was also a fraction (less than a cent per hour, assuming the GPU is fully utalized) of the paid services. I kinda assumed others got to the same conclusions I did back then, but my own conclusions, so plenty I could have done wrong.

Whisper offers this performance level practically out of the box, although with a lot higher inference costs. I, sadly, haven't had the time yet to finetune it. Nor have I found the time to optimize inference costs.

> E.g. it does not work well for streaming (getting instant recognition results, usually within 100ms, or 500ms, or max 1sec)

If you're okay with intermediary results getting improved later this is doable, although at a factor increased cost. Offline works like a charm though.

> Also, I'm quite sure it has some strange failure cases, as AED models tend to have, like repeating some labels, or skipping to the end of a sequence (or just chunk) when it got confused.

True that.

albertzeyer t1_j6ebian wrote on January 29, 2023 at 7:18 PM

It's a bit strange indeed that the GCP or Azure results are not so great. As said, I do actually research on speech recognition, and Google is probably the biggest player in this field, and usually always with the very best results.

My explanation is, they don't really use such good and big models for GCP. Maybe they want to reduce the computational cost as much as possible.

But you also anyway have to be a bit careful in what you compare. Your results might be flawed when your finetuning data is close to your validation set (e.g. similar domain, similar sound conditions). Because in case of GCP, they have very generic models, working for all kinds of domains, all kinds of conditions.

nmfisher t1_j67czbb wrote on January 28, 2023 at 6:06 AM

Those models so exist (just search for "ASR seq2seq"), it's just that CTC has always been faster/more stable/more effective method for training since it avoids the need to learn specific alignments between input features and output "units" (phonemes/subwords/letters/whatever).

The view was that encoder/decoder modesl needed considerably more training data/longer training times, and usually underperformed. However, I just came across https://arxiv.org/pdf/2205.01086.pdf which found a method for fine-tuning a pre-trained seq2seq encoder that actually outperformed CTC in on small datasets, so that may no longer be the case.