Submitted by super_deap t3_11tmpc5 in MachineLearning
Screye t1_jcmpd5i wrote
Reply to comment by VarietyElderberry in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
This is more derived from extensive personal experience with prompt engineering / fine tuning over the last 2 years.
Simply put:
- The model learns what it sees. Or, throw enough data of a certain type and emergent properties relating to that data will shop given enough data & compute.
- If it has never seen data past 8k tokens in the past (due to context window limitations), the model won't need to learn to reason over more than 8k tokens.
- The source data (humans) have limitations on the complexity of thoughts that can be captured within 8k tokens vs 32k tokens
- That's not say that the model doesn't reason over longer windows using latent knowledge, which makes its implicit 'reasoning window' much larger than just 8k tokens. But, that is fundamentally different than explicitly reasoning over a 32k window.
- The model today can only assemble a chain-of-thought prompt of 8k tokens. If there is never any human feedback or loss-landscape-optimization for when it fails to reason past 8k tokens, then any ability the model gains there will be purely incidental.
- On the other hand, when you have chain-of-thought prompt chains that are 32k tokens long, we can naturally expect it to contain more axioms, postulates and relationships between those postulates/axioms.
- Those completions will get evaluated against human feedback & just self-supervised-scenarios, which should explicitly optimize the loss landscape to reason over far more complex logical statements.
Idk if that makes sense. Our field keeps moving away from math, and as embarrassing as it is to antromorphize the model, it does make it easier to get the point across.
Viewing a single comment thread. View all comments