Intelligent_Rough_21 OP t1_j3frivp wrote
Reply to comment by geneing in [D] Looking for a dataset of Text-To-Speech audiobook-style Speech Synthesis Markup Language (SSML) files by Intelligent_Rough_21
Ok I’ll admit to only having used neural models not trained them. AWS Polly is incredibly monotoned last I used it.
geneing t1_j3g1gwa wrote
Most likely you are using the original Polly method, which is based on gluing together sounds of different phonemes. That produces monotone speech.
Try Google wavenet. It's available through google cloud api just like Polly.
There's a neural version of Polly, but I never tried it.
Intelligent_Rough_21 OP t1_j3g2vvp wrote
Yeah I was using neural poly which is equivalent to wavenet. What I discovered is it will always say the same sentence, and usually the same word used in the same way, the same way, regardless of context clues. “My gosh.” Would always render exactly the same way. Really needs paragraph or dialogue driven context, as well as a bit of randomization. In a book where an author has a repetitive goto word or phrase it’s killer.
geneing t1_j3hzfpy wrote
I think what you are looking for is called "expressive TTS". There have been a ton of papers in the last couple of years on the topic. Many provide code.
I've had some success with simply preserving the hidden state of the network from one sentence to the next.
SSML may not be expressive enough for your application.
Intelligent_Rough_21 OP t1_j3lkkbq wrote
Thanks for the reference I’ll look into it
Viewing a single comment thread. View all comments