ChuckSeven

ChuckSeven t1_jeenkvs wrote

I'm happy to take evidence into account. Your results indicate that LLM can be beneficial for translation. As I said previously, it looks interesting. But you claim, and I quote: "They re far superior to current sota" solely based on your personal and human comparison. This is an over-generalisation and not scientific. Like a flat earther claiming the earth is flat because .. just look at it "evidence at plain sight".

1

ChuckSeven t1_jeeae4o wrote

Look, it doesn't matter. You can't claim that LLM are better if you don't demonstrate it on an established benchmark with a large variety of translations. How should I know if those Japanese anime translations are correct? For what its worth it might be just "prettier" text but a wrong translation.

It's sad to get downvoted on this subreddit for insisting on very basic academic principles.

2

ChuckSeven t1_jedsgz5 wrote

I know about this post. It is interesting but the results here are far from conclusive. The BLOOM papers also did translation experiments and they say "... In the one-shot setting, BLOOM can, with the right prompt, perform competent translation, although it is behind dedicated (supervised) models such as M2M-100".

So let's maybe use some quantifiable measures instead of just looking at a few cherry-picked examples and claim otherwise?

0

ChuckSeven t1_jeab590 wrote

It's funny how you mention unrelated stuff, like RLHF, which has nothing to do with the point of discussion. A bit like an LLM I reckon.

See, Google translate models are (as far as publicly known) trained on a parallel corpus. This is supervised data since it provides the same text in different languages. The model is trained to model, e.g. p(y=German|x=English). There is much less supervised data available which means that the models you train will be significantly smaller. Note that translation models are usually only auto-regressive in the decoding part. The encoder part, which usually makes up about 50% of the parameters, is not auto-regressive.

LLMs tend to be >>1B parameter models trained on billions or trillions of tokens. The vast amount of data is believed to be necessary to train such large models. The models are modelling p(x) which in some cases is monolingual or virtually so. An LLM that is trained on a vast but only English corpus will not be capable of translating at all. LLM trained on a multi-lingual corpus can be prompted to translate but they are far inferior to actual translation models.

Lastly, modelling p(y|x) is significantly easier and thus less general than modelling p(x).

−4

ChuckSeven t1_jea2b99 wrote

Google translate is certainly not an LLM. LLM can do translation but they are significantly worse than translation models trained on translation data. They have an encoder-decoder architecture as it is a sequence-to-sequence model and not a autoregressive architecture like LLMs do.

They are also not pretrained afaik. Since language modelling is modelling p(x) but translation is p(y|x).

−6

ChuckSeven t1_je55o02 wrote

The Transformer is not a universal function approximator. This is simply shown by the fact that it cannot process arbitrary long input due to the finite context limitations.

Your conclusion is not at all obvious or likely given your facts. They seem to be in hindsight given the strong performance of large models.

It's hard to think of chatgpt as a very large transformer ... because we don't know how to think about very large transformers.

1

ChuckSeven t1_j9iyuc2 wrote

hmm not sure, but I think if you don't exponentiate you cannot fit n targets into a d-dimensional space if n > d and you want there to exist a vector v for each target such that the outcome is a one-hot distribution (or 0 loss).

Basically, if you have 10 targets but only a 2-dimensional space you need to have enough non-linearity in the projection to your target space such that there exists a 2d vector which gives 0 loss for each target.

edit: MNIST only has 10 classes so you are probably fine. Furthermore, softmax of the dot product "care exponentially more" about the angle of the prediction vector than the scale. If you use norm, I'd think that you only care about angle which likely leads to different representations. The fact that those may improve performance highly depends how your model may rely on scale to learn certain predictions. Maybe in case of mnist, relying on scale worsens performance (e.g. if you want a wild guess, because it maybe makes "predictions more certain" simply if it has more pixels set to 1).

3

ChuckSeven t1_j0zvd2j wrote

You are saying that you and your keyboard are not capable of producing fake news?

I do get your point but I don't think that anything will be different. It might require a little less effort but with less money than what it costs to buy a gun, you can already do these things. Vilifying the tool is not the right approach. It never has been.

1

ChuckSeven t1_j0u37tj wrote

But what is confidence really? It's a measure based on how likely an outcome is given a specific model. The idea of confidence is completely broken if you are not certain about your model. E.g., if you think that our error is normal distributed with a certain variance you can make statements if a divination from the expected value is noise or not. But this assumes that your normal distribution assumption is correct! If you cannot be certain about the model, which you never really are if you use neural networks, then the confidence is measured against your own implicit model. And since NNs are very different from your own brain and the models used in both cases are likely computing a different function, AND the NNs is not trained to predict confidence (from a human perspective) there is no meaningful way of talking about confidence.

1