_Arsenie_Boca_ t1_je9n0ea wrote on March 30, 2023 at 12:50 PM

Reply to comment by RicketyCricket in [D] Alternatives to fb Hydra? by alyflex

Thanks, I basically use only the config part of hydra and am regularly annoyed that its so clunky, so spock might be a good alternative. Gonna check it out :)

_Arsenie_Boca_ t1_je8km8c wrote on March 30, 2023 at 4:56 AM

Reply to [R] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention by floppy_llama

Very interesting work! Though I find the explanation of the concrete approach (how the additional parameters are used by the LM) to be a bit vague. Does anyone have a deeper understanding? Is it using regular adapters?

_Arsenie_Boca_ t1_je6ayl7 wrote on March 29, 2023 at 6:45 PM

Reply to comment by RicketyCricket in [D] Alternatives to fb Hydra? by alyflex

Thanks, looks like your library isn't far behind hydra in terms of functionality. Will definitely look into it more closely the next time I set up a project.

What would you say are the pros and cons between hydra and spock?

_Arsenie_Boca_ t1_je5d04j wrote on March 29, 2023 at 3:09 PM

Reply to comment by RicketyCricket in [D] Alternatives to fb Hydra? by alyflex

Looks interesting, a bit more lightweight than hydra. But also misses a lot of cool features like composing multiple yaml configs

_Arsenie_Boca_ t1_jdqy1n8 wrote on March 26, 2023 at 2:28 PM

Reply to comment by Co0k1eGal3xy in Is it possible to merge transformers? [D] by seraphaplaca2

Merging model outputs also means you have to run both models. I think the best option is to merge the weights and recover performance using datasets from both domains and distillation from the respective expert model.

_Arsenie_Boca_ t1_jdlc2ah wrote on March 25, 2023 at 6:11 AM

Reply to comment by learn-deeply in [R] Reflexion: an autonomous agent with dynamic memory and self-reflection - Noah Shinn et al 2023 Northeastern University Boston - Outperforms GPT-4 on HumanEval accuracy (0.67 --> 0.88)! by Singularian2501

Thanks! If that is really the TL;DR, I have never seen an abstract that beats about the bush so much

_Arsenie_Boca_ t1_jder0rp wrote on March 23, 2023 at 9:31 PM

Reply to [N] ChatGPT plugins by Singularian2501

Bad news for langchain

_Arsenie_Boca_ t1_jdc0ko2 wrote on March 23, 2023 at 9:24 AM

Reply to comment by Icko_ in [P] Open-source GPT4 & LangChain Chatbot for large PDF docs by radi-cho

True, but im not sure how much cheaper that would really be.

_Arsenie_Boca_ t1_jdbsl4b wrote on March 23, 2023 at 7:24 AM

Reply to [P] Open-source GPT4 & LangChain Chatbot for large PDF docs by radi-cho

What are the costs for all the services? I assume GPT-4 is billed per request and Pinecone per hour?

_Arsenie_Boca_ t1_jd6u2my wrote on March 22, 2023 at 7:02 AM

Reply to [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

First time I hear sparse pretraining and dense finetuning. Usually its the other way around right? So that you get faster inference speeds. Is it correct that you are aiming for faster pretraining through sparsity here, while having normal dense inference speeds?

Also, could you provide an intuition on how cerebras is able to translate unstructured sparsity to speedups? Since you pretrained a 1.3B model, I assume it runs on GPU, unlike DeepSparse?

_Arsenie_Boca_ t1_jbbh5ng wrote on March 7, 2023 at 9:07 PM

Reply to comment by __Maximum__ in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

Until convergence is something that we often say and hear but makes no sense by definition. Convergence never ends

_Arsenie_Boca_ t1_jb1wjfi wrote on March 5, 2023 at 8:22 PM

Reply to comment by bo_peng in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

It does help but certainly doesnt make everything clear. I am confident I could run inference on it, but my interest is rather academic than practical.

What is the magic number 5 all about? It seems to appear all over the code without explanation.

Are the time mixing and channel mixing operations novel or were they introduced by a citable work?

How does the parallelization during training work?

_Arsenie_Boca_ t1_jb0sm2c wrote on March 5, 2023 at 3:49 PM

Reply to [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

I have been following your reddit posts for some while now, but I still dont think I fully understand it. Did you consider writing a paper? It might help people get the method and might fuel the open source help you get.

_Arsenie_Boca_ t1_jai5zgz wrote on March 1, 2023 at 4:45 PM

Reply to comment by limpbizkit4prez in [R] EvoPrompting: Language models can create novel and effective deep neural architectures. These architectures are also able to outperform those designed by human experts (with few-shot prompting) by MysteryInc152

The final evaluation is done on test metrics right? If so, why does it matter?

_Arsenie_Boca_ OP t1_j9gix7q wrote on February 21, 2023 at 8:02 PM

Reply to comment by Professional_Poet489 in [D] Bottleneck Layers: What's your intuition? by _Arsenie_Boca_

If I understand you correctly, that would mean that bottlenecks only interesting when

a) you further use the lower dimensional features as output like in autoencoders b) you are interested in knowing if your features have lower intrinsic dimension

Both are not met in many cases such as normal ResNets. Could you elaborate how you believe bottlenecks act as regularizers?

_Arsenie_Boca_ OP t1_j9ghq1m wrote on February 21, 2023 at 7:53 PM

Reply to comment by MediumOrder5478 in [D] Bottleneck Layers: What's your intuition? by _Arsenie_Boca_

That makes a lot of sense. So in that train of thought, bottlenecks are somewhat specific to CNNs, right? Or do you see a similar reasoning in fully connected networks or transformers?

_Arsenie_Boca_ OP t1_j9gg06n wrote on February 21, 2023 at 7:38 PM

Reply to comment by aMericanEthnic in [D] Bottleneck Layers: What's your intuition? by _Arsenie_Boca_

Thanks for your comment. Could you elaborate? Do you mean bottlenecks dont have any benefit? If so, why would people use them?

_Arsenie_Boca_ t1_j8a4q2t wrote on February 12, 2023 at 8:13 PM

Reply to [D] Yann Lecun seems to be very petty against ChatGPT by supersoldierboy94

His position as rival makes his statements look petty, and they might be. But still, I agree with most of his statements you quoted here.

_Arsenie_Boca_ OP t1_j7ommq8 wrote on February 8, 2023 at 8:17 AM

Reply to comment by PassingTumbleweed in [D] Papers that inject embeddings into LMs by _Arsenie_Boca_

Yes, seamless joint training is definitely one of the perks. I will look further if I can find anything about the effectiveness of different injection/fusion mechanisms.

_Arsenie_Boca_ OP t1_j7miglb wrote on February 7, 2023 at 9:39 PM

Reply to comment by PassingTumbleweed in [D] Papers that inject embeddings into LMs by _Arsenie_Boca_

Thanks, good pointer. I am particularly interested in the different mechanisms how the embeddings might be integrated into LMs. E.g. in PaLI and SimVLM, the external embeddings (here image encodings) are simply treated as token embeddings. Others use modified attention mechanisms to potentially make better use of the information. Are you aware of a work that directly compares multiple integration mechanisms?

_Arsenie_Boca_ OP t1_j7jxkxr wrote on February 7, 2023 at 9:34 AM

Reply to comment by wittfm in [D] Papers that inject embeddings into LMs by _Arsenie_Boca_

Thanks for the answer, but Im afraid the idea there is quite different. They take embeddings from LMs and finetune them, rather than aligning and injecting external embeddings.

_Arsenie_Boca_ t1_j72g4g4 wrote on February 3, 2023 at 4:23 PM

Reply to comment by alpha-meta in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

Im not sure if they vary the sampling hyperparemeters. The point is that langauge modelling objectives are to some degree ill-posed because we calculate the loss on intermediate results rather than the final output that we care about.

_Arsenie_Boca_ t1_j6z24n6 wrote on February 2, 2023 at 10:10 PM

Reply to [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

Since it wasnt mentioned so far: RL does not require the loss/reward to be differentiable. This enables us to learn from complete generated sentences (LM sampling is not differentiable) rather than just on token-level