Submitted by t0mkaka t3_z9ps9k in MachineLearning

Hi all,

This is similar to some other posts about doing podcast transcribing episodes.

I used whisper models to downloade and transcribe them and then made them in to Full text searchable.

The architecture is simple from RSS -> Download -> Transcribe -> Segment -> Ingest to DB for search.

For the fully available transcript, I also use auto highlighting to highlight important segments of podcast using Wink NLP.

​

here is the URl : https://www.castdop.com

​

I can add around 1400 hours of content per day.

Any feedback / comment /questions is appreciated.

P.S. : let me know if this violates some rules, I just posted because I saw similar posts before.

75

Comments

You must log in or register to comment.

Acceptable-Cress-374 t1_iyjflu4 wrote

I've been meaning to play around with whisper, but never got the time. Does it do any kind of voice / person segmentation as well? Can it tell speakers apart, say in a high quality input such as a podcast?

7

forfooinbar t1_iyjzzry wrote

Whisper doesn't do speaker diarization AFAIK. It will just be one big blob of text.

4

t0mkaka OP t1_iykoh4k wrote

Yes there is no speaker diarization. That will solve problems in this model also and will make search better.

2

thundergolfer t1_iykjcgc wrote

How do you deploy and scale the transcription? Is it on GPU, which model variant?

I also built a whisper transcription app: modal.com/docs/guide/whisper-transcriber. It can do serverless CPU transcription on-demand. You can check it out and borrow from it if it's useful. The code is open-source.r

PS.: Yes this does violate rule 5 (promote on weekends). I violated the same rule when I posted my whisper app :)

2

t0mkaka OP t1_iyko74k wrote

Yes, It's on GPU. I used tiny and medium. I haven't tried large because I wanted to run fast. I tried for 3-4 days to parallelize and was inpired by your post also and one by Assembly who demoed with parallelized.

But unfortunately, I was not able to parallelize. Whisper uses 30 seconds clips and then for the next 30 seconds it passes the last 30 seconds text as prompt. Since podcasts are not cut out in 30 seconds so I needed to enter the prompt in anycase. I cannot transcribe them independently.

I deploy on vast.ai for cheap GPU usage in a day and run 2 models parallely. The GPU memory usage is low around 30% but the GPU CPU usage goes to full and then speed begins to fall after 2 parallel models. So I run 2 inference runs per GPU. I have used only 1 GPU at the moment and not scaled it but it should not be tough task now.

3

TheMrZZ0 t1_iyou3a8 wrote

I'm curious - what's the source of the data? Being able to add 1400h of new content every day sound great.

1

t0mkaka OP t1_iypifl5 wrote

Audio files I download from the links in the RSS feed. Then I am generating the transcripts using whisper. Not always great but it works most of the time.

1

Chintan_Mehta t1_j1yeks2 wrote

>I also use auto highlighting to highlight important segments of podcast using Wink NLP

Can you elaborate more on this? How do you do it and have the results been always good? I want to do similar work for my own personal project on research papers and it would be great if I could see the code as well (I'm a student and its difficult for me to understand everything from conversation!)

1