Osemwaro OP t1_j04kh8k wrote
Reply to comment by farmingvillein in [D] Why are ChatGPT's initial responses so unrepresentative of the distribution of possibilities that its training data surely offers? by Osemwaro
Ah yes, I see that the GPT-3 tutorial discusses controlling the entropy as you described with a temperature parameter, which presumably corresponds to a softmax temperature. That sounds like a likely culprit.
I don't have an NLP background, so I'm not familiar with the literature, but I did some Googling and came across a recent paper called "Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions", which says
>In this paper, we discover that, when predicting the next word probabilities given an ambiguous context, GPT-2 is often incapable of assigning the highest probabilities to the appropriate non-synonym candidates.
The GPT-3 paper says that GPT-2 and GPT-3 "use the same model and architecture", so I wonder if the softmax bottleneck is part of the problem that I've observed too.
Viewing a single comment thread. View all comments