maizeq t1_j2w7p8k wrote
Reply to comment by Mental-Swordfish7129 in [R] Do we really need 300 floats to represent the meaning of a word? Representing words with words - a logical approach to word embedding using a self-supervised Tsetlin Machine Autoencoder. by olegranmo
Is this following a pre-existing methodology in the literature or something custom for your usage? I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space. How do you achieve something similar in your binary latent space?
Sorry for all the questions!
Mental-Swordfish7129 t1_j2x3juw wrote
Idk if it's in the literature. At this point, I can't tell what I've read from what has occurred to me.
I keep track of the error each layer generates and also a brief history of its descending predictions. Then, I simply reinforce the generation of predictions which favor the highest rate of reduction in subsequent error. I think this amounts to a modulation of attention (manifested as a pattern of bit masking of the ascending error signal) which amounts to ignoring the portions of the signal which have low information and high variance.
At the bottom layer, this is implemented as choosing behaviors (moving a reticle over an image u,d,l,r) which accomplish the same avoidance of high variance and thus high noise, but seeking high information gain.
The end result is a reticle which behaves like a curious agent attempting to track new, interesting things and study them a moment before getting bored.
The highest layers seem to be forming composite abstractions on what is happening below, but I have yet to try to understand.
I'm fine with questions.
Mental-Swordfish7129 t1_j2xqlwa wrote
The really interesting thing as of late is that if I "show" the agent, as part of its input, its global error metric alongside forcing (moving the reticle directly) it out of boredom toward higher information gain, I can eventually stop the forcing because it learns to force itself out of boredom. It seems to learn the association between a rapidly declining error and a shift to a more interesting input. I just have to facilitate the bootstrapping.
It eventually exhibits more and more sophisticated behavioral sequences (higher cycle before repeating) and the same at higher levels with the attentional changes.
All layers perform the same function. They only differ because of the very different "world" to which they are exposed.
Mental-Swordfish7129 t1_j2xrr7a wrote
>How do you achieve something similar in your binary latent space?
All data coming in is encoded into these high-dimensional binary vectors where each index in a vector corresponds to a relevant feature in the real world. Then, computing error is as simple as XOR(actual incoming data, prediction). This preserves the semantic details of how the prediction was wrong.
There is no fancy activation function. A simple sum of all connected synapses which connect to an active element.
Synapses are binary. Connected or not. They decay over time and their permanence is increased if they're useful often enough.
Mental-Swordfish7129 t1_j2y3l18 wrote
>I usually see attention in PP implemented, conceptually at least, as variance parameterisation/optimisation over a continuous space.
Continuous spaces are simply not necessary for what I'm doing. I avoid infinite precision because there is little need for precision beyond a certain threshold.
Also, I'm just a regular guy. I do this in my limited spare time and I only have relatively weak computational resources and hardware. I'm trying to be more efficient anyway; like the brain. It makes it all very efficient because there is not a floating point operation in sight.
Discrete space works just fine and there is no ambiguity possible for what a particular index of the space represents. In a continuous space, you'd have to worry that something has been truncated or rounded away.
Idk. Maybe my reasons are ridiculous.
Viewing a single comment thread. View all comments