_Arsenie_Boca_
_Arsenie_Boca_ t1_je8km8c wrote
Reply to [R] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention by floppy_llama
Very interesting work! Though I find the explanation of the concrete approach (how the additional parameters are used by the LM) to be a bit vague. Does anyone have a deeper understanding? Is it using regular adapters?
_Arsenie_Boca_ t1_je6ayl7 wrote
Reply to comment by RicketyCricket in [D] Alternatives to fb Hydra? by alyflex
Thanks, looks like your library isn't far behind hydra in terms of functionality. Will definitely look into it more closely the next time I set up a project.
What would you say are the pros and cons between hydra and spock?
_Arsenie_Boca_ t1_je5d04j wrote
Reply to comment by RicketyCricket in [D] Alternatives to fb Hydra? by alyflex
Looks interesting, a bit more lightweight than hydra. But also misses a lot of cool features like composing multiple yaml configs
_Arsenie_Boca_ t1_jdqy1n8 wrote
Reply to comment by Co0k1eGal3xy in Is it possible to merge transformers? [D] by seraphaplaca2
Merging model outputs also means you have to run both models. I think the best option is to merge the weights and recover performance using datasets from both domains and distillation from the respective expert model.
_Arsenie_Boca_ t1_jdlc2ah wrote
Reply to comment by learn-deeply in [R] Reflexion: an autonomous agent with dynamic memory and self-reflection - Noah Shinn et al 2023 Northeastern University Boston - Outperforms GPT-4 on HumanEval accuracy (0.67 --> 0.88)! by Singularian2501
Thanks! If that is really the TL;DR, I have never seen an abstract that beats about the bush so much
_Arsenie_Boca_ t1_jder0rp wrote
Reply to [N] ChatGPT plugins by Singularian2501
Bad news for langchain
_Arsenie_Boca_ t1_jdc0ko2 wrote
Reply to comment by Icko_ in [P] Open-source GPT4 & LangChain Chatbot for large PDF docs by radi-cho
True, but im not sure how much cheaper that would really be.
_Arsenie_Boca_ t1_jdbsl4b wrote
What are the costs for all the services? I assume GPT-4 is billed per request and Pinecone per hour?
_Arsenie_Boca_ t1_jd6u2my wrote
Reply to [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101
First time I hear sparse pretraining and dense finetuning. Usually its the other way around right? So that you get faster inference speeds. Is it correct that you are aiming for faster pretraining through sparsity here, while having normal dense inference speeds?
Also, could you provide an intuition on how cerebras is able to translate unstructured sparsity to speedups? Since you pretrained a 1.3B model, I assume it runs on GPU, unlike DeepSparse?
_Arsenie_Boca_ t1_jbbh5ng wrote
Reply to comment by __Maximum__ in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__
Until convergence is something that we often say and hear but makes no sense by definition. Convergence never ends
_Arsenie_Boca_ t1_jb1wjfi wrote
Reply to comment by bo_peng in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.
What is the magic number 5 all about? It seems to appear all over the code without explanation.
Are the time mixing and channel mixing operations novel or were they introduced by a citable work?
How does the parallelization during training work?
_Arsenie_Boca_ t1_jb0sm2c wrote
Reply to [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng
I have been following your reddit posts for some while now, but I still dont think I fully understand it. Did you consider writing a paper? It might help people get the method and might fuel the open source help you get.
_Arsenie_Boca_ t1_jai5zgz wrote
Reply to comment by limpbizkit4prez in [R] EvoPrompting: Language models can create novel and effective deep neural architectures. These architectures are also able to outperform those designed by human experts (with few-shot prompting) by MysteryInc152
The final evaluation is done on test metrics right? If so, why does it matter?
_Arsenie_Boca_ OP t1_j9gix7q wrote
Reply to comment by Professional_Poet489 in [D] Bottleneck Layers: What's your intuition? by _Arsenie_Boca_
If I understand you correctly, that would mean that bottlenecks only interesting when
a) you further use the lower dimensional features as output like in autoencoders b) you are interested in knowing if your features have lower intrinsic dimension
Both are not met in many cases such as normal ResNets. Could you elaborate how you believe bottlenecks act as regularizers?
_Arsenie_Boca_ OP t1_j9ghq1m wrote
Reply to comment by MediumOrder5478 in [D] Bottleneck Layers: What's your intuition? by _Arsenie_Boca_
That makes a lot of sense. So in that train of thought, bottlenecks are somewhat specific to CNNs, right? Or do you see a similar reasoning in fully connected networks or transformers?
_Arsenie_Boca_ OP t1_j9gg06n wrote
Reply to comment by aMericanEthnic in [D] Bottleneck Layers: What's your intuition? by _Arsenie_Boca_
Thanks for your comment. Could you elaborate? Do you mean bottlenecks dont have any benefit? If so, why would people use them?
Submitted by _Arsenie_Boca_ t3_118cypl in MachineLearning
_Arsenie_Boca_ t1_j8a4q2t wrote
His position as rival makes his statements look petty, and they might be. But still, I agree with most of his statements you quoted here.
_Arsenie_Boca_ OP t1_j7ommq8 wrote
Reply to comment by PassingTumbleweed in [D] Papers that inject embeddings into LMs by _Arsenie_Boca_
Yes, seamless joint training is definitely one of the perks. I will look further if I can find anything about the effectiveness of different injection/fusion mechanisms.
_Arsenie_Boca_ OP t1_j7miglb wrote
Reply to comment by PassingTumbleweed in [D] Papers that inject embeddings into LMs by _Arsenie_Boca_
Thanks, good pointer. I am particularly interested in the different mechanisms how the embeddings might be integrated into LMs. E.g. in PaLI and SimVLM, the external embeddings (here image encodings) are simply treated as token embeddings. Others use modified attention mechanisms to potentially make better use of the information. Are you aware of a work that directly compares multiple integration mechanisms?
_Arsenie_Boca_ OP t1_j7jxkxr wrote
Reply to comment by wittfm in [D] Papers that inject embeddings into LMs by _Arsenie_Boca_
Thanks for the answer, but Im afraid the idea there is quite different. They take embeddings from LMs and finetune them, rather than aligning and injecting external embeddings.
Submitted by _Arsenie_Boca_ t3_10vwm8k in MachineLearning
_Arsenie_Boca_ t1_j72g4g4 wrote
Reply to comment by alpha-meta in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
Im not sure if they vary the sampling hyperparemeters. The point is that langauge modelling objectives are to some degree ill-posed because we calculate the loss on intermediate results rather than the final output that we care about.
_Arsenie_Boca_ t1_j6z24n6 wrote
Reply to [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
Since it wasnt mentioned so far: RL does not require the loss/reward to be differentiable. This enables us to learn from complete generated sentences (LM sampling is not differentiable) rather than just on token-level
_Arsenie_Boca_ t1_je9n0ea wrote
Reply to comment by RicketyCricket in [D] Alternatives to fb Hydra? by alyflex
Thanks, I basically use only the config part of hydra and am regularly annoyed that its so clunky, so spock might be a good alternative. Gonna check it out :)