Hi all,

This is similar to some other posts about doing podcast transcribing episodes.

I used whisper models to downloade and transcribe them and then made them in to Full text searchable.

The architecture is simple from RSS -> Download -> Transcribe -> Segment -> Ingest to DB for search.

For the fully available transcript, I also use auto highlighting to highlight important segments of podcast using Wink NLP.

here is the URl : https://www.castdop.com

I can add around 1400 hours of content per day.

Any feedback / comment /questions is appreciated.

P.S. : let me know if this violates some rules, I just posted because I saw similar posts before.

Comments

You must log in or register to comment.

Acceptable-Cress-374 t1_iyjflu4 wrote on December 1, 2022 at 9:07 PM

I've been meaning to play around with whisper, but never got the time. Does it do any kind of voice / person segmentation as well? Can it tell speakers apart, say in a high quality input such as a podcast?

forfooinbar t1_iyjzzry wrote on December 1, 2022 at 11:27 PM

Whisper doesn't do speaker diarization AFAIK. It will just be one big blob of text.

The_frozen_one t1_iykajiv wrote on December 2, 2022 at 12:46 AM

You can play around with it here: https://whisper.ggerganov.com/

It's significantly slower (approx 50 times slower) than the natively compiled version (https://github.com/ggerganov/whisper.cpp) but you can at least get a sense of accuracy using the online version.

t0mkaka OP t1_iykoh4k wrote on December 2, 2022 at 2:35 AM

Yes there is no speaker diarization. That will solve problems in this model also and will make search better.

thundergolfer t1_iykjcgc wrote on December 2, 2022 at 1:56 AM

How do you deploy and scale the transcription? Is it on GPU, which model variant?

I also built a whisper transcription app: modal.com/docs/guide/whisper-transcriber. It can do serverless CPU transcription on-demand. You can check it out and borrow from it if it's useful. The code is open-source.r

PS.: Yes this does violate rule 5 (promote on weekends). I violated the same rule when I posted my whisper app :)

t0mkaka OP t1_iyko74k wrote on December 2, 2022 at 2:33 AM

Yes, It's on GPU. I used tiny and medium. I haven't tried large because I wanted to run fast. I tried for 3-4 days to parallelize and was inpired by your post also and one by Assembly who demoed with parallelized.

But unfortunately, I was not able to parallelize. Whisper uses 30 seconds clips and then for the next 30 seconds it passes the last 30 seconds text as prompt. Since podcasts are not cut out in 30 seconds so I needed to enter the prompt in anycase. I cannot transcribe them independently.

I deploy on vast.ai for cheap GPU usage in a day and run 2 models parallely. The GPU memory usage is low around 30% but the GPU CPU usage goes to full and then speed begins to fall after 2 parallel models. So I run 2 inference runs per GPU. I have used only 1 GPU at the moment and not scaled it but it should not be tough task now.

thundergolfer t1_iyocw2v wrote on December 2, 2022 at 10:04 PM

Thanks for the details.

> one by Assembly who demoed with parallelized.

What was this demo? Got a link?

t0mkaka OP t1_iypil3i wrote on December 3, 2022 at 3:36 AM

https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/

Here. Search for same link on reddit r/MachineLearning you fill find the original post also.

TheMrZZ0 t1_iyou3a8 wrote on December 3, 2022 at 12:16 AM

I'm curious - what's the source of the data? Being able to add 1400h of new content every day sound great.

t0mkaka OP t1_iypifl5 wrote on December 3, 2022 at 3:35 AM

Audio files I download from the links in the RSS feed. Then I am generating the transcripts using whisper. Not always great but it works most of the time.

Chintan_Mehta t1_j1yeks2 wrote on December 28, 2022 at 8:48 AM

>I also use auto highlighting to highlight important segments of podcast using Wink NLP

Can you elaborate more on this? How do you do it and have the results been always good? I want to do similar work for my own personal project on research papers and it would be great if I could see the code as well (I'm a student and its difficult for me to understand everything from conversation!)