Hey guys. I'm running some experiments as part of a research project; it was initially implemented for GPT-Neo 1.3B, but there is one baseline we want to use that only supports GPT2-XL, so I implemented that into our code (i.e., just included a clause that was like "if model_name='gpt2', model=GPT2LMHeadModel.frompretrained('gpt2-xl')").

The issue is, GPT2 is giving absolutely absurd results that are clearly very incorrect. It is difficult to explain this without an in-depth explanation of my code, but basically I have a bunch of functions that do things like, for example, computing the probability of certain labels in a multiple-choice test.

So my question is, is there any fundamental difference in how these two models are setup in hugging face, that would result in such errors? I myself am not too familiar with hugging face models, so I'm not entirely sure. But the fact that the code runs but produces bad errors is weird; I would think that if something was wrong, there would be some sort of tensor-size doesn't match error somewhere...

Comments

CKtalon t1_j4enpew wrote on January 15, 2023 at 4:14 AM

#1,365,364

GPT2 was trained on a different dataset, with little code (other than those obtained from the CommonCrawl). GPT Neo uses The Pile which contains a lot of code.

GasZealousideal8691 OP t1_j4eo0ov wrote on January 15, 2023 at 4:17 AM

#1,365,380

Replying to CKtalon (#1,365,364)

Oh sorry if that wasn’t clear, but the stuff I’m training on isn’t code, it’s natural language.

WigglyHypersurface t1_j4f1r8b wrote on January 15, 2023 at 6:23 AM

#1,365,967

Did you forget to change the tokenizer?

m98789 t1_j4f27pu wrote on January 15, 2023 at 6:28 AM

#1,365,987

Replying to WigglyHypersurface (#1,365,967)

Tokenizer is also my guess

GasZealousideal8691 OP t1_j4g8djf wrote on January 15, 2023 at 2:41 PM

#1,367,693

Replying to WigglyHypersurface (#1,365,967)

No, both use the GPT2 tokenizer. GPT-Neo uses GPT2Tokenizer.from_pretrained(‘EleutherAI/gpt-neo-1.3B)”, and GPT2 uses GPT2Tokenizer.from_pretrained(‘gpt2-xl’).

WigglyHypersurface t1_j4gpm5i wrote on January 15, 2023 at 4:42 PM

#1,368,693

Replying to GasZealousideal8691 (#1,367,693)

What kind of head is on the models for the task?

GasZealousideal8691 OP t1_j4gpu8j wrote on January 15, 2023 at 4:44 PM

#1,368,707

Replying to WigglyHypersurface (#1,368,693)

GPT Neo is GPTNeoForCausalLM, and GPT2 is GPT2LMHeadModel. Like I said, I am not 100% familiar with these, but the huggingface docs listed both as “GPT-neo/GPT2 with an LM head”, so I figured they were analogous.

WigglyHypersurface t1_j4gr979 wrote on January 15, 2023 at 4:53 PM

#1,368,771

Replying to GasZealousideal8691 (#1,365,380)

The amount of code in the training data might effect specific task performance, even if the task itself involves no code. Seems to maybe be particularly the case for tasks requiring attention to long range dependencies and abstract reasoning.

WigglyHypersurface t1_j4grftr wrote on January 15, 2023 at 4:54 PM

#1,368,782

Replying to GasZealousideal8691 (#1,368,707)

I think those are the same but make both the causal version and see.

GasZealousideal8691 OP t1_j4gs8gc wrote on January 15, 2023 at 4:59 PM

#1,368,814

Replying to WigglyHypersurface (#1,368,771)

But would it affect it to this extent? To be clear, this is not just "bad performance", or "horrendous performance". Our project is loosely investigating the performance of different editing methods on LMs given some datasets we made, and none of the editing methods, from fine-tuning to gradient-methods, change the performance at all.

Furthermore, GPT2 outputs an equal accuracy and specificity values (specificity is basically the degree to which it "remembers" other unrelated facts; the goal here is to minimize catastrophic forgetting), which makes absolutely 0 sense, because they aren't even measured on the same scale. Accuracy is usually >0, <1 and specificity is usually ~26 based on our measures.

It doesn't have anything to do with the way accuracy/specificity are computed, because the code for GPT-Neo is identical minus the model= and tokenizer= statements, and it works fine for GPT-Neo. So there is something fundamentally crazy going on with GPT2...