TrainquilOasis1423 t1_j91k8px wrote on February 18, 2023 at 3:11 PM

Is the next step in LLMs to predict the entire next sentence?

From what I understand LLMs mostly just predict the next word in a sentence. With just this we have seen HUGE advancement and emergent behavior out of what could essentially be called level 1 of this tech. So then would making a machine learning architecture to predict the entire next sentence be the next logical step? After that would it be entire paragraphs? What would be the challenges of making such an architecture?

trnka t1_j91sshb wrote on February 18, 2023 at 4:13 PM

It doesn't look like it's headed that way, no. The set of possible next sentences is just too big to iterate over or to compute a softmax over, so it's broken down into words. In fact, the set of possible words is often too big so it's broken down into subwords with methods like byte pair encoding and WordPiece.

The key when dealing with predicting one word or subword at a time is to model long-range dependencies well enough so that the LM can generate coherent sentences and paragraphs.

TrainquilOasis1423 t1_j91uvav wrote on February 18, 2023 at 4:27 PM

Makes sense. To expand on the number of possible iterations wouldn't it be something akin to a collapsing wave function? Like trying to iterate through all possible responses would be impossible, but the list of probable responses shrinks as the context expands.

For example if I just input "knock" there are too many possible sentences to search, but if I input "knock knock". The most likely response is "who's there?" A simple example sure, but you get the point yea?

trnka t1_j91xnym wrote on February 18, 2023 at 4:46 PM

In terms of probabilities yeah that's right.

In the actual code, it's most common to do a softmax over the output vocabulary. In practice that means the model computes the probability of every possible next output (whether word or subword) and then we sort it, take the argmax, or the top K depending on the problem.

I think about generating one word at a time as a key part of the way we're searching through the space of probable sentences, because we can't afford to brute-force search.