ChuckSeven t1_jeenkvs wrote on March 31, 2023 at 1:46 PM

Reply to comment by MysteryInc152 in [D] Can large language models be applied to language translation? by matthkamis

I'm happy to take evidence into account. Your results indicate that LLM can be beneficial for translation. As I said previously, it looks interesting. But you claim, and I quote: "They re far superior to current sota" solely based on your personal and human comparison. This is an over-generalisation and not scientific. Like a flat earther claiming the earth is flat because .. just look at it "evidence at plain sight".

ChuckSeven t1_jeeae4o wrote on March 31, 2023 at 11:57 AM

Reply to comment by MysteryInc152 in [D] Can large language models be applied to language translation? by matthkamis

Look, it doesn't matter. You can't claim that LLM are better if you don't demonstrate it on an established benchmark with a large variety of translations. How should I know if those Japanese anime translations are correct? For what its worth it might be just "prettier" text but a wrong translation.

It's sad to get downvoted on this subreddit for insisting on very basic academic principles.

ChuckSeven t1_jedwq3n wrote on March 31, 2023 at 9:10 AM

Reply to comment by nullbyte420 in [D] Turns out, Othello-GPT does have a world model. by Desi___Gigachad

Would be nice to have a "philosophy for ai scientist" article just like the "machine learning mathematics for physicists" work. Something nice and concise.

ChuckSeven t1_jeduwfs wrote on March 31, 2023 at 8:44 AM

Reply to comment by MysteryInc152 in [D] Can large language models be applied to language translation? by matthkamis

The presented examples are intriguing but your general statements requires a proper evaluation. Afaik, no bilingual LLM has yet beaten the state of the art on an established translation benchmark.

ChuckSeven t1_jedsgz5 wrote on March 31, 2023 at 8:07 AM

Reply to comment by MysteryInc152 in [D] Can large language models be applied to language translation? by matthkamis

I know about this post. It is interesting but the results here are far from conclusive. The BLOOM papers also did translation experiments and they say "... In the one-shot setting, BLOOM can, with the right prompt, perform competent translation, although it is behind dedicated (supervised) models such as M2M-100".

So let's maybe use some quantifiable measures instead of just looking at a few cherry-picked examples and claim otherwise?

ChuckSeven t1_jeab590 wrote on March 30, 2023 at 3:48 PM

Reply to comment by ZestyData in [D] Can large language models be applied to language translation? by matthkamis

It's funny how you mention unrelated stuff, like RLHF, which has nothing to do with the point of discussion. A bit like an LLM I reckon.

See, Google translate models are (as far as publicly known) trained on a parallel corpus. This is supervised data since it provides the same text in different languages. The model is trained to model, e.g. p(y=German|x=English). There is much less supervised data available which means that the models you train will be significantly smaller. Note that translation models are usually only auto-regressive in the decoding part. The encoder part, which usually makes up about 50% of the parameters, is not auto-regressive.

LLMs tend to be >>1B parameter models trained on billions or trillions of tokens. The vast amount of data is believed to be necessary to train such large models. The models are modelling p(x) which in some cases is monolingual or virtually so. An LLM that is trained on a vast but only English corpus will not be capable of translating at all. LLM trained on a multi-lingual corpus can be prompted to translate but they are far inferior to actual translation models.

Lastly, modelling p(y|x) is significantly easier and thus less general than modelling p(x).

ChuckSeven t1_jea2b99 wrote on March 30, 2023 at 2:46 PM

Reply to comment by ZestyData in [D] Can large language models be applied to language translation? by matthkamis

Google translate is certainly not an LLM. LLM can do translation but they are significantly worse than translation models trained on translation data. They have an encoder-decoder architecture as it is a sequence-to-sequence model and not a autoregressive architecture like LLMs do.

They are also not pretrained afaik. Since language modelling is modelling p(x) but translation is p(y|x).

ChuckSeven t1_je55o02 wrote on March 29, 2023 at 2:20 PM

Reply to comment by Haycart in [D] Very good article about the current limitations of GPT-n models by fripperML

The Transformer is not a universal function approximator. This is simply shown by the fact that it cannot process arbitrary long input due to the finite context limitations.

Your conclusion is not at all obvious or likely given your facts. They seem to be in hindsight given the strong performance of large models.

It's hard to think of chatgpt as a very large transformer ... because we don't know how to think about very large transformers.

ChuckSeven t1_jcjt0je wrote on March 17, 2023 at 10:40 AM

Reply to comment by No-Belt7582 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

This is the way.

ChuckSeven t1_j9iyuc2 wrote on February 22, 2023 at 8:31 AM

Reply to Unit Normalization instead of Cross-Entropy Loss [Discussion] by thomasahle

hmm not sure, but I think if you don't exponentiate you cannot fit n targets into a d-dimensional space if n > d and you want there to exist a vector v for each target such that the outcome is a one-hot distribution (or 0 loss).

Basically, if you have 10 targets but only a 2-dimensional space you need to have enough non-linearity in the projection to your target space such that there exists a 2d vector which gives 0 loss for each target.

edit: MNIST only has 10 classes so you are probably fine. Furthermore, softmax of the dot product "care exponentially more" about the angle of the prediction vector than the scale. If you use norm, I'd think that you only care about angle which likely leads to different representations. The fact that those may improve performance highly depends how your model may rely on scale to learn certain predictions. Maybe in case of mnist, relying on scale worsens performance (e.g. if you want a wild guess, because it maybe makes "predictions more certain" simply if it has more pixels set to 1).

ChuckSeven t1_j8t5r5m wrote on February 16, 2023 at 7:46 PM

Reply to comment by MustachedSpud in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

yea it depends. Even just batch-size makes a difference. But for really big models, I'd assume that the number of weights far outweighs the number of activations.

ChuckSeven t1_j8svm1b wrote on February 16, 2023 at 6:43 PM

Reply to comment by MustachedSpud in [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms by ExponentialCookie

those are way less. for every vector of activations you usually have that squared in weights time 2 or 3 depending of how many momentum values you keep.

ChuckSeven t1_j5jnow1 wrote on January 23, 2023 at 2:12 PM

Reply to comment by CarelessBar2844 in [D] ICLR 2023 results. by East-Beginning9987

link?

ChuckSeven t1_j0zvd2j wrote on December 20, 2022 at 5:26 PM

Reply to comment by MAVAAMUSICMACHINE in [D] What happens when / when will video and audio be possible? by TdotONY

You are saying that you and your keyboard are not capable of producing fake news?

I do get your point but I don't think that anything will be different. It might require a little less effort but with less money than what it costs to buy a gun, you can already do these things. Vilifying the tool is not the right approach. It never has been.

ChuckSeven t1_j0yds4b wrote on December 20, 2022 at 9:23 AM

Reply to [D] What happens when / when will video and audio be possible? by TdotONY

We already have realistic-looking generated media. Have you not seen the latest Hollywood movies? Almost all movies use CGI nowadays and most don't notice it at all.

ChuckSeven t1_j0u37tj wrote on December 19, 2022 at 12:47 PM

Reply to [D] Is softmax a good choice for confidence? by thanderrine

But what is confidence really? It's a measure based on how likely an outcome is given a specific model. The idea of confidence is completely broken if you are not certain about your model. E.g., if you think that our error is normal distributed with a certain variance you can make statements if a divination from the expected value is noise or not. But this assumes that your normal distribution assumption is correct! If you cannot be certain about the model, which you never really are if you use neural networks, then the confidence is measured against your own implicit model. And since NNs are very different from your own brain and the models used in both cases are likely computing a different function, AND the NNs is not trained to predict confidence (from a human perspective) there is no meaningful way of talking about confidence.

ChuckSeven t1_j0u2grw wrote on December 19, 2022 at 12:40 PM

Reply to comment by CriticalTemperature1 in [D] ChatGPT, crowdsourcing and similar examples by mvujas

> ut I would even argue that continuing a conversation is a form of positive feedback or even coming back to the websit

It is way cheaper to take real conversations and have a crowdworker label it for being a good conversation or a bad conversation.

ChuckSeven t1_j0u2e2c wrote on December 19, 2022 at 12:39 PM

Reply to [D] ChatGPT, crowdsourcing and similar examples by mvujas

It reminded me of Tesla's data engine.

ChuckSeven t1_j016zp2 wrote on December 13, 2022 at 9:39 AM

Reply to comment by -xylon in [D] G. Hinton proposes FF – an alternative to Backprop by mrx-ai

Doubt. I know the old stories too but large language models are essentially trained like that. Most never do an epoch twice and evaluate the model periodically.

ChuckSeven t1_j016ugh wrote on December 13, 2022 at 9:37 AM

Reply to comment by aleph__one in [D] G. Hinton proposes FF – an alternative to Backprop by mrx-ai

Time for you to write a series of beginner tutorials for the community! Now would be a good time.

ChuckSeven t1_j016rtg wrote on December 13, 2022 at 9:36 AM

Reply to comment by modeless in [D] G. Hinton proposes FF – an alternative to Backprop by mrx-ai

Why exactly is hardware cheaper and more efficient?

ChuckSeven t1_j016or8 wrote on December 13, 2022 at 9:35 AM

Reply to comment by pr0u in [D] G. Hinton proposes FF – an alternative to Backprop by mrx-ai

I know this is a joke but Juergen doesn't have much work in this direction. He also doesn't care about biological plausibility. So I highly doubt that this will happen.

ChuckSeven t1_j016l2h wrote on December 13, 2022 at 9:33 AM

Reply to comment by Akrenion in [D] G. Hinton proposes FF – an alternative to Backprop by mrx-ai

That's actually a fair point. The optimisation lottery if you will. Where architectures are biased because they are designed around the algorithms that can be scaled and have shown to "work".

ChuckSeven t1_iwqey5x wrote on November 17, 2022 at 4:04 PM

Reply to [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng

what is the size of the opt model you are comparing with in that table?

ChuckSeven t1_itkjrbj wrote on October 24, 2022 at 9:54 AM

Reply to comment by 123A321 in [P] Look up words by their description by phraisely

Sounds like it's a fine-tuned large language model or maybe just a large language model with a good prompting technique.