lucidraisin
lucidraisin t1_jcznnvh wrote
Reply to comment by Unlucky_Excitement_2 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
actually, i'm keeping an eye on Hyena! there are however a number of issues i still have with the paper (i'm not going to play reviewer 2, as it is not my place nor is reddit a good forum for that), but i intend to reserve judgement and try it out on few difficult problems like genomics and EEG later this year. proof is in the pudding.
lucidraisin t1_jczarq8 wrote
Reply to comment by antonb90 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
that isn't for decoders. encoder only, and still needs to be verified. the majority of research paper never work out on closer examination. just trust me, stick with flash attention for now until further notice and save yourself a lot of headache
lucidraisin t1_jcl6ecd wrote
Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
no worries, thanks for running the experiments and sharing your results 🙏
lucidraisin t1_jcl2rkh wrote
Reply to comment by super_deap in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
yea, literature is scant and all over the place in the efficient attention field. in this paper, i believe they claim it is query-key dimension (d_dot), but i think it should depend on the number of heads too. i don't know of any other papers that explore this topic. i just don't want people to be surprised if they fine tune to greater context lengths and things don't work as well as gpt4
lucidraisin t1_jcl0y16 wrote
it is important for everyone to know that there may be a capacity limit to the context length, as explored by this paper. gpt4 may not have this limit, but smaller variants like llama may. it also depends on the task you are trying to solve. you cannot just get 'infinite context', as some would sell you that their network can do. more research needed... hopefully pytorch 2.0 leads to that
lucidraisin t1_jamtx7b wrote
Reply to comment by fmai in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
it cannot, the compute still scales quadratically although the memory bottleneck is now gone. however, i see everyone training at 8k or even 16k within two years, which is more than plenty for previously inaccessible problems. for context lengths at the next order of magnitude (say genomics at million basepairs), we will have to see if linear attention (rwkv) pans out, or if recurrent + memory architectures make a comeback.
lucidraisin t1_jakdtf7 wrote
Reply to comment by Thunderbird120 in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
yes
edit: it was also used to train Llama. there is no reason not to use it at this point, for both training and fine-tuning / inference
lucidraisin t1_jakb7h4 wrote
lucidraisin t1_j9h8fu4 wrote
Reply to comment by Animated-AI in [P] The First Depthwise-separable Convolution Animation by Animated-AI
one for transformers, or even just multi head attention would be amazing! do you have a patreon?
lucidraisin t1_j61h7lf wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
and one more paper along same lines! https://arxiv.org/abs/2212.07677
lucidraisin t1_j5z7z6g wrote
Reply to [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II. by NaturalGradient
CMA-ES! definitely playing around with this, thank you!
lucidraisin t1_j2ff2cy wrote
Reply to comment by Ronny_Jotten in An Open-Source Version of ChatGPT is Coming [News] by lambolifeofficial
right right, more work remains to be done after the new years. we will get there
lucidraisin t1_j2exfpq wrote
Reply to comment by Ronny_Jotten in An Open-Source Version of ChatGPT is Coming [News] by lambolifeofficial
my repositories are more than proof of concept. they have led to the training of significant models, Stable Diffusion among them.
but still it is deceptive to the average person to tell them that chatgpt replication is imminent. good code is just a prerequisite to begin the journey. it will take data, compute, adventurers to actually set sail, and in the case of chatgpt, a complicated process of gathering human feedback (I will do my best to lower the activation energy by building a simple and concise app that covers all cases, assuming RLHF does not get outdated by another technique)
lucidraisin t1_jczoelv wrote
Reply to comment by Unlucky_Excitement_2 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
yea no problem, happy to chat more if you are doing research in this space. you can always reach out to me through email