hapliniste t1_itwncro wrote
I'm interested as well. Just bought a 3090 so I have a bit more room. I think I saw optimized gptj that should run on it but haven't tried so far.
AuspiciousApple OP t1_itwoeep wrote
A bit jealous of all that VRAM and all those cores.
The usage example here: https://huggingface.co/google/flan-t5-xl is quite easy to follow. Getting it up and running should take you all of 5 minutes plus the time to download the model. You could probably also run the XXL model.
AuspiciousApple OP t1_itwohgl wrote
Would be curious to hear whether you get reasonable output with the XXL variant or with GPTJ.
utilop t1_itwvjtb wrote
I haven't read this paper so do not know the details.
However, for SNARKS, it is defined as "Determine which of two sentences is sarcastic".
I must be missing something in that case because the base rate should be 50 %.
In the paper, they seem to state that even with CoT, Flan gets 9.6 % (small), 42.7 % (base), 60.1 % (large), 60.1 % (XL), 55.1 % (XXL).
So if I am interpreting it correctly, it is not doing much better than random chance even for the larger models, and I would not expect a good CoT nor significantly better results from testing on the larger model.
Detecting sarcasm might not be the best use of this model?
Not sure how they get so much less than 50 % - perhaps it includes failures to generate a valid answer.
AuspiciousApple OP t1_itww8zh wrote
I only skimmed the paper, but I think they said that (at least for some benchmarks) they count exact matches as correct, so yes maybe generating anything but the answer doens't count?
I tried the example from this dataset that they use in one of their figures, and I seemed to get the correct answer with the XL variant most of the time, but the rationale was nonsense ~80% of the time even when it was correct. E.g. "Plastic containers are a thing of the past" or "Plastic containers are way too precious to store food" or "Wood is not sturdy enough to store food".
utilop t1_itwxusd wrote
I think that would make sense and could see the small models - in particular with CoT - fail to produce a valid answer.
For both MMLU and BBH, they report a worse average score with CoT than the direct prompt.
I would take that as CoT not reliably producing correct explanations, as it does not encourage good answers.
Could be that the problem is their prompt, few-shot setup, or calibration though?
Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?
AuspiciousApple OP t1_itwy8bj wrote
>Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?
That sounds a good idea, though NLP isn't really my field, so I might also not be using the correct sampling parameters/make subtle mistakes in writing the question (e.g. punctuation, line breaks, etc.), so I was hoping someone here would know more.
Even for English to German translation, the model often generated obvious nonsense, sometimes even just repeating the english phrase, despite using the prompt as it is in the hugging face config/paper.
Viewing a single comment thread. View all comments