Mefaso t1_j29980m wrote on December 30, 2022 at 3:58 PM

>i found that text to video problem is being actively researched and may not require as much compute as bare language models

There are always opportunities for research with little compute, usually this means your research has to avoid training new models, or at least avoid training from scratch.

However, text to video models are typically very compute extensive

Complete-Maximum-633 t1_j2ab7zy wrote on December 30, 2022 at 8:01 PM

Anything with “video” is going to be costly.

WobblySilicon OP t1_j2d1n2x wrote on December 31, 2022 at 9:44 AM

question is how much cost? can it be done with one GPU or do i need a swarm of those?

Complete-Maximum-633 t1_j2drquz wrote on December 31, 2022 at 2:41 PM

Impossible to answer without more context.

WobblySilicon OP t1_j2ffcey wrote on December 31, 2022 at 9:42 PM

Sure! Sir!

In the months to come i would be working on the problem of text to video. After literature review i got the idea that it might be compute extensive, like a cluster of GPUs required to train the models. So I asked that if it could be done with a mediocre GPU such as a 3080. I haven't really thought about the models i would use or general architecture of the model. Just wanted an answer, because i dont wish to take up this topic then get stuck due to compute issues.

[deleted] t1_j2a02a8 wrote on December 30, 2022 at 6:49 PM

[deleted]

WobblySilicon OP t1_j2a0oop wrote on December 30, 2022 at 6:53 PM

I do have access to an A6000 for a few days. Other resources (less memory) are available by the university as well. By compute expensive I mean whole clusters of gpus...

I have difficulty in trying to wrap my head around text to video problem (particularly the newer models with many smaller components). Are their any suggestions/resources to get acquainted with this new task..? I have read recent research papers but it seems hard to find an area where improvement could be made by technical customization of base models. Do you have any tips on this?

Finally, If I cant work on text to video then my other option would be deep fake detection. Can you comment on merits or demerits of choosing this topic for my study? Both topics are very new for me. I have exposure to intermediate vision based problems and feel confident enough to try these out. Right now it just feels that I am out of ideas for any tinkering with the base models.