Submitted by Desi___Gigachad t3_126rgih in MachineLearning
FermiAnyon t1_jebpsg3 wrote
Reply to comment by mattsverstaps in [D] Turns out, Othello-GPT does have a world model. by Desi___Gigachad
Yeah, isotropic as in being the same in all directions. So we're probably all familiar with embedding space and the fact that the positional relationships between concepts in embedding space basically encodes information about those relationships. Isotropy in language models refers to the extent to which concepts which are actually unrelated appear unrelated in embedding space.
In other words, a model without this property might havre an embedding space that isn't large enough, but you're still teaching it things and the result is that you're cramming things into your embedding space that's too small, so unrelated concepts are no longer equidistant from other unrelated concepts, implying a relationship that doesn't really exist with the result being that the language model confuses things that shouldn't be confused.
Case in point: I asked chatgpt to give me an example build order for terrans in Broodwar and it proceeded to give me a reasonable sounding build order, except that it was mixing in units from Starcraft 2. Now no human familiar with the games would confuse units like that. I chalk that up to a lack of relevant training data, possibly mixed with an embedding space that's not large enough for the model to be isotropic.
That's my take anyway. I'm still learning ;) please someone chime in and fact check me :D
mattsverstaps t1_jedlch4 wrote
So is that saying that there is a kind of linear transformation happening between some space (the reality? Our personal model?) and the embedding space? I don’t know what embedding space is and I shouldn’t be here but you are saying interesting things.
FermiAnyon t1_jee34lx wrote
Glad you're here. This would be a really interesting chat for like a bar or a meetup or stunting ;)
But yeah, I'm just giving my impressions. I don't want to make any claims of authority or anything as I'm self taught with this stuff...
But yeah, I have no idea how our brains do it, but when you're building a model whether it's a neural net or you're just factoring a matrix, you'll end up with a high dimensional representation that'll get used as an input to another layer or that'll just be used straight away for classification. It may be overly broad, but I think of all of those high dimensional representations as embeddings and the dimensionality available for encoding an embedding as the embedding space.
Like if you were into sports and you wanted to organize your room so that distance represents relationships between equipment. Maybe the baseball is right next to the softball and the tennis racket is close to the table tennis paddle, but they're a little farther away from the baseball stuff, then you've got some golf clubs and they're kind of in one area of the room because they all involve hitting things with another thing. Then your kite flying stuff and your fishing stuff and your street luge stuff is kind of as far apart as possible from the other stuff because it's not obvious to me anyway that they're related. Your room is a two dimensional embedding space.
When models do it, they just do it with more dimensions and more concepts, but they learn where to put things so that the relationships are properly represented and they just learn all that from lots of cleverly crafted examples.
monks-cat t1_jefqotb wrote
Context radically changes the "distance" between concepts. So in your example isotropy isn't necessarily a desired property of a LLM. In poetry, for example, we combine two concepts that would seemingly be very far apart in the original space but should be mapped rather closely in the embedding.
​
The problem I see with this whole idea though is that a "concept" doesn't inherently seem to be represented by list of features. Two concepts interacting aren't necessarily the intersection of their features.
I'll try to see if I can come up with concrete examples in language.
FermiAnyon t1_jegh3hd wrote
In this case, I'm using a fuzzy word "concept" to refer to anything that's differentiable from another thing. That includes things like context and semantics and whether a word is polysemantic and even whether things fit a rhyme scheme. Basically anything observable.
But again, I'm shooting from the hip
Pas7alavista t1_jefcp4e wrote
Embedding is a way to map the high dimension vectors in your input space to a lower dimension space.
mattsverstaps t1_jefvu45 wrote
So the extra dimensions are unnecessary? I just realised that there could be some situations in which non orthogonal dimensions are preferable. I can’t exactly think of them. Doesn’t it suggest a pattern in data if a mapping is found that reduces the dimension? Like I picture from linear algebra 101 finding a line that everything is a multiple of so one dimension would do and that line is a ‘pattern’? Sorry I’m high.
Pas7alavista t1_jeg5dhh wrote
>so the extra dimensions are unnecessary
Yes one reason for embedding is to get extract relevant features.
Also, any finite dimensional inner product space has an orthonormal basis, and the math is easiest this way so there's not much of a reason to describe a space using non orthogonal dimensions. There is also nothing stopping you from doing so though.
>Doesn't it suggest a pattern in data if a mapping is found that reduces dimension
Yeah generally you wouldn't attempt to use ML methods on data where you think there is no pattern
>Something something Linear algebra
I think you might be thinking about the span and or basis but it's hard for me to interpret your question
mattsverstaps t1_jegdsq4 wrote
Yes the span, so if we discover that a set of points is actually all in the span of a line, that line is a kind of fact or pattern about the points. So probably there is an equivalent in higher dimensions. I am seeing there is a problem whereby we introduce our own bias in creating our model.
Pas7alavista t1_jegu8de wrote
The span describes the entire space. It's a set of vectors that you can combine using addition and multiplication in order to obtain any other vector in the space. For example a spanning set over the real number plane would be {(1,0), (0,1)}. This particular set is also an orthonormal basis and you can think of each vector as representing two orthogonal dimensions. This is because their dot product is 0.
However, any set of two vectors that are not on the same line will span the real number plane. For example, {(1,1), (0,1)} spans the real number plane, but they are not orthogonal.
Overall though it is always important to be aware of your input space, and the features/dimensions that you use to represent it. You can easily introduce bias or just noise in a number of ways if you aren't thorough. One example would be not normalizing your data.
turnip_burrito t1_jedrykf wrote
>In other words, a model without this property might havre an embedding space that isn't large enough, but you're still teaching it things and the result is that you're cramming things into your embedding space that's too small, so unrelated concepts are no longer equidistant from other unrelated concepts, implying a relationship that doesn't really exist with the result being that the language model confuses things that shouldn't be confused.
So False Nearest Neighbors?
FermiAnyon t1_jee0e86 wrote
Sounds legit :)
turnip_burrito t1_jeds3mv wrote
>unrelated concepts are no longer equidistant from other unrelated concepts,
Are distances normally the same for all unrelated concepts in a very high dimensional space? Does this have to do with unrelated concepts having low correlation in coordinates, so random distances in each axis, and therefore on average the same distance between each pair of unrelated concepts as any other unrelated pair?
FermiAnyon t1_jee03oc wrote
My pretty tenuous grasp of the idea makes me thing stuff like... if you're measuring Euclidean distance or cosine similarity between two points that represent concepts that are completely unrelated, what would that distance or that angle be? And that, ideally, all things that are completely unrelated, if you did a pairwise comparison, would have that distance or that angle. And that the embedding space is large enough to accommodate that. And it sounds to me like kind of a limit property that it may only be possible to approximate because there's like lots of ideas and only so many dimensions to fit them in...
turnip_burrito t1_jee0mso wrote
Here's what GPT4 chimed in with (it lines up with what I've heard and read before):
===========
In an N-dimensional space with N >> 1, where M random vectors are dispersed with each coordinate sampled from a uniform distribution within a bounded range, we can make a few observations about the distances between these vectors:
High-dimensional space behavior: In high-dimensional spaces, the behavior of distances between random points differs significantly from what we would expect in lower-dimensional spaces like 2D or 3D. In high-dimensional spaces, most points tend to be far away from each other, and the distances between them are often more similar than they would be in lower-dimensional spaces.
Concentration of distances: As the dimensionality N increases, the pairwise distances between the M random vectors tend to concentrate around a specific value. The distances between points in high-dimensional spaces tend to be less varied than in lower-dimensional spaces, and the majority of the distances will be relatively close to the expected value. This phenomenon is known as the "concentration of measure."
Sparse representation: In high-dimensional spaces, the points are often sparsely distributed, which means that there is a lot of empty space between them. This can lead to a counterintuitive property, where increasing the number of dimensions might actually make it easier to distinguish between the points, as they become more "isolated" in the space.
Curse of dimensionality: While the above-mentioned properties might sometimes be useful, high-dimensional spaces can also give rise to the "curse of dimensionality." This term refers to various challenges that arise when working with high-dimensional data, such as increased computational complexity, difficulty in visualizing the data, and the need for a larger number of samples to obtain statistically meaningful results.
To summarize, in an N-dimensional space with N >> 1, the pairwise distances between M random vectors will generally be far from each other and concentrate around a specific value. High-dimensional spaces exhibit unique properties like the concentration of measure and sparse representation, but they also come with challenges like the curse of dimensionality
FermiAnyon t1_jee3nc1 wrote
What did you prompt it with? And what do you think of its answer?
turnip_burrito t1_jefysiz wrote
My prompt:
> Suppose I have an N>>1 dimensional space, finite in extent along any given axis, in which a set of M random vectors are dispersed (each coordinate of each vector is randomly sampled from a uniform distribution spanning some bounded range of the space). What can we say about the distances in this space between the M vectors?
I left my prompt open ended to not give it any ideas one way or another.
Its response makes sense to me. The standard deviation of a set of random samples from a uniform distribution centered at mean 0, which is proportional to the distance calculated here, should shrink as dimension N grows. If N is large, then the distribution of pairwise distances will narrow until nearly all points are roughly the same distance from each other. (The random sampling is a way to build in lack of correlation, like how you mentioned unrelated ideas)
Of course, the reverse is also true: if dimension N is small, then originally "far" points will become closer or farther (which one effect exactly is unpredictable depending on which dimensions are removed) because the averaging over random sample fluctuations disappears.
FermiAnyon t1_jegiycj wrote
Pretty neat stuff. Fits well with the conversation we were having. I guess a salient question how large an embedding space do you need before performance in any given task plateaus.
Except that they're not random vectors in the original context.
turnip_burrito t1_jegu7uk wrote
Yeah I made the simplification of random vectors myself just to approximate what uncorrelated "features" in an embedding space could be like.
One thing that's relevant for embedding space size Takens theorem: https://en.wikipedia.org/wiki/Takens%27s_theorem?wprov=sfla1
If you have an originally D dimensional system (measured using correlation or information dimension for example), and you time delay embed data from the system, you at most (can be lower) need 2*D+1 embedding dimensions to ensure no false nearest neighbors.
This sets an upper bound if you use time delays. Now, for a *non-*time delayed embedding, I don't know the answer. I asked GPT4 and it said no analytical method for determining embedding dimension M presently exists ahead of time. An experimental method does exist that you can perform before training a model: You need to grow the number of embedding dimensions M and calculate FNN every time M grows. Once FNN drops to near zero, then you've finally found a suitable M.
One neat part about all this is that if you have some complex D-dimensional manifold or distribution with features that "poke out" into different directions in the embedding space (imagine a wheel hub with spokes), then increasing the embedding space size M will also increase the distance between the spokes. If M gets large enough, all the spokes should be nearly equal in distance from each other, but points along a singular spoke are also far from each other in most directions except for just a small subset.
I don't think that making it super large would actually make learning on the data any easier though. Best to stick with close to the minimum embedding dimension M. If you get larger, then measurement noise in your data becomes more represented in the embedded distribution. These dynamics also unfold when you increase M, which means if you're trying to only predict the D-dimensional system, you'll have harder time because now you're predicting a (D+large#) dimensional system and the obviousness of the D-dimensional system distribution gets lost in the larger distribution.
Viewing a single comment thread. View all comments