lucidraisin t1_jczoelv wrote on March 20, 2023 at 7:49 PM

Reply to comment by Unlucky_Excitement_2 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

yea no problem, happy to chat more if you are doing research in this space. you can always reach out to me through email

lucidraisin t1_jcznnvh wrote on March 20, 2023 at 7:44 PM

Reply to comment by Unlucky_Excitement_2 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

actually, i'm keeping an eye on Hyena! there are however a number of issues i still have with the paper (i'm not going to play reviewer 2, as it is not my place nor is reddit a good forum for that), but i intend to reserve judgement and try it out on few difficult problems like genomics and EEG later this year. proof is in the pudding.

lucidraisin t1_jczarq8 wrote on March 20, 2023 at 6:22 PM

Reply to comment by antonb90 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

that isn't for decoders. encoder only, and still needs to be verified. the majority of research paper never work out on closer examination. just trust me, stick with flash attention for now until further notice and save yourself a lot of headache

lucidraisin t1_jcl6ecd wrote on March 17, 2023 at 5:04 PM

Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

no worries, thanks for running the experiments and sharing your results 🙏

lucidraisin t1_jcl2rkh wrote on March 17, 2023 at 4:41 PM

Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

yea, literature is scant and all over the place in the efficient attention field. in this paper, i believe they claim it is query-key dimension (d_dot), but i think it should depend on the number of heads too. i don't know of any other papers that explore this topic. i just don't want people to be surprised if they fine tune to greater context lengths and things don't work as well as gpt4

lucidraisin t1_jcl0y16 wrote on March 17, 2023 at 4:30 PM

Reply to [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

it is important for everyone to know that there may be a capacity limit to the context length, as explored by this paper. gpt4 may not have this limit, but smaller variants like llama may. it also depends on the task you are trying to solve. you cannot just get 'infinite context', as some would sell you that their network can do. more research needed... hopefully pytorch 2.0 leads to that

lucidraisin t1_jamtx7b wrote on March 2, 2023 at 3:46 PM

Reply to comment by fmai in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

it cannot, the compute still scales quadratically although the memory bottleneck is now gone. however, i see everyone training at 8k or even 16k within two years, which is more than plenty for previously inaccessible problems. for context lengths at the next order of magnitude (say genomics at million basepairs), we will have to see if linear attention (rwkv) pans out, or if recurrent + memory architectures make a comeback.

lucidraisin t1_jakdtf7 wrote on March 2, 2023 at 1:27 AM

Reply to comment by Thunderbird120 in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

yes

edit: it was also used to train Llama. there is no reason not to use it at this point, for both training and fine-tuning / inference

lucidraisin t1_jakb7h4 wrote on March 2, 2023 at 1:08 AM

Reply to comment by Thunderbird120 in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

it is flash attention (Tri Dao et al)

lucidraisin t1_j9h8fu4 wrote on February 21, 2023 at 11:23 PM

Reply to comment by Animated-AI in [P] The First Depthwise-separable Convolution Animation by Animated-AI

one for transformers, or even just multi head attention would be amazing! do you have a patreon?

lucidraisin t1_j61h7lf wrote on January 27, 2023 at 1:07 AM

Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

and one more paper along same lines! https://arxiv.org/abs/2212.07677

lucidraisin t1_j5z7z6g wrote on January 26, 2023 at 4:23 PM

Reply to [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient

CMA-ES! definitely playing around with this, thank you!

lucidraisin t1_j2ff2cy wrote on December 31, 2022 at 9:40 PM

Reply to comment by Ronny_Jotten in An Open-Source Version of ChatGPT is Coming [News] by lambolifeofficial

right right, more work remains to be done after the new years. we will get there

lucidraisin t1_j2exfpq wrote on December 31, 2022 at 7:32 PM

Reply to comment by Ronny_Jotten in An Open-Source Version of ChatGPT is Coming [News] by lambolifeofficial

my repositories are more than proof of concept. they have led to the training of significant models, Stable Diffusion among them.

but still it is deceptive to the average person to tell them that chatgpt replication is imminent. good code is just a prerequisite to begin the journey. it will take data, compute, adventurers to actually set sail, and in the case of chatgpt, a complicated process of gathering human feedback (I will do my best to lower the activation energy by building a simple and concise app that covers all cases, assuming RLHF does not get outdated by another technique)