clauwen t1_j2sucur wrote
Reply to comment by Mental-Swordfish7129 in [R] Do we really need 300 floats to represent the meaning of a word? Representing words with words - a logical approach to word embedding using a self-supervised Tsetlin Machine Autoencoder. by olegranmo
Maybe im an idiot, but depending on precision, this is not much smaller of an encoding, as a lot of other model use, right? And none of the state of the art embedding models are at all optimized for space, right?
Mental-Swordfish7129 t1_j2t17wy wrote
Idk much about other encoding systems. This works well for my purposes. It's scalable. I look at my data and ask, "how many binary features of each datum are salient and also which features are important to the model for judging similarities"? 2000 may be too much sometimes. Also, remember that a binary vector is often handled as an integer array indicating the index of bits set to 1. If your vectors are sparse it can be very efficient. For the AI models I build, my vectors are often quite sparse because I often use a scheme like a "slider" of activations for integer data; sort of like "one hot", but I'll do three or more consecutive to encode associativity.
Mental-Swordfish7129 t1_j2t22qq wrote
The biggest reason I use this encoding is because of the latent space it creates. My AI models are of the SDM variety with a predictive processing architecture computing something very similar to active inference. This encoding allows for complete universality and the latent space provides for the generation of semantically relevant memory abstractions.
maizeq t1_j2to73g wrote
What type of predictive processing architecture exactly if you don’t mind saying?
Mental-Swordfish7129 t1_j2tubij wrote
It's pretty vanilla.
Message passing up is prediction error.
Down is prediction used as follows:
I use the bottom prediction to characterize external behavior.
Prediction at higher levels characterizes attentional masking and other alterations to the ascending error signals.
maizeq t1_j2w7p8k wrote
Is this following a pre-existing methodology in the literature or something custom for your usage? I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space. How do you achieve something similar in your binary latent space?
Sorry for all the questions!
Mental-Swordfish7129 t1_j2x3juw wrote
Idk if it's in the literature. At this point, I can't tell what I've read from what has occurred to me.
I keep track of the error each layer generates and also a brief history of its descending predictions. Then, I simply reinforce the generation of predictions which favor the highest rate of reduction in subsequent error. I think this amounts to a modulation of attention (manifested as a pattern of bit masking of the ascending error signal) which amounts to ignoring the portions of the signal which have low information and high variance.
At the bottom layer, this is implemented as choosing behaviors (moving a reticle over an image u,d,l,r) which accomplish the same avoidance of high variance and thus high noise, but seeking high information gain.
The end result is a reticle which behaves like a curious agent attempting to track new, interesting things and study them a moment before getting bored.
The highest layers seem to be forming composite abstractions on what is happening below, but I have yet to try to understand.
I'm fine with questions.
Mental-Swordfish7129 t1_j2xqlwa wrote
The really interesting thing as of late is that if I "show" the agent, as part of its input, its global error metric alongside forcing (moving the reticle directly) it out of boredom toward higher information gain, I can eventually stop the forcing because it learns to force itself out of boredom. It seems to learn the association between a rapidly declining error and a shift to a more interesting input. I just have to facilitate the bootstrapping.
It eventually exhibits more and more sophisticated behavioral sequences (higher cycle before repeating) and the same at higher levels with the attentional changes.
All layers perform the same function. They only differ because of the very different "world" to which they are exposed.
Mental-Swordfish7129 t1_j2xrr7a wrote
>How do you achieve something similar in your binary latent space?
All data coming in is encoded into these high-dimensional binary vectors where each index in a vector corresponds to a relevant feature in the real world. Then, computing error is as simple as XOR(actual incoming data, prediction). This preserves the semantic details of how the prediction was wrong.
There is no fancy activation function. A simple sum of all connected synapses which connect to an active element.
Synapses are binary. Connected or not. They decay over time and their permanence is increased if they're useful often enough.
Mental-Swordfish7129 t1_j2y3l18 wrote
>I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space.
Continuous spaces are simply not necessary for what I'm doing. I avoid infinite precision because there is little need for precision beyond a certain threshold.
Also, I'm just a regular guy. I do this in my limited spare time and I only have relatively weak computational resources and hardware. I'm trying to be more efficient anyway; like the brain. It makes it all very efficient because there is not a floating point operation in sight.
Discrete space works just fine and there is no ambiguity possible for what a particular index of the space represents. In a continuous space, you'd have to worry that something has been truncated or rounded away.
Idk. Maybe my reasons are ridiculous.
Viewing a single comment thread. View all comments