Viewing a single comment thread. View all comments

tdgros t1_izwkav3 wrote

ViTs keep the same dimension because of the residual connections in the transformer blocks.

At the very end, you want to sum-up the information if you want to do classification, but because all tokens are equivalent, you just average them before further decoding i.e. if you concatenated all the tokens before a linear layer, it'd end up looking like a global pooling.

2

DeepGamingAI t1_izwolrw wrote

Thanks, that clarifies some things. I have also seen a parameter in the ViT head that simply returns the first token representation instead of averaging across all tokens. I never understood why that made sense, and why only the first token and not some other random token.

This also reminds me of another confusion I have about transformers, would they lose meaning if we gradually compress the embedding size after every mlp in the transformer block?

1

tdgros t1_izwppx0 wrote

You can either take all the existing tokens, average and decode them into logits. But if you can do that, you can also do that with one single token after all.

Or you can append a special learned token at some point, that will have its special decoder, I believe that's what you're describing. You can find this approach in BERT where a CLS token is inserted before any sentence. One final similar approach is Perceiver IO's where the decoder is a transformer whose query is a learned array.

You can change the dimensionality with 1x1 in between transformers, you wouldn't lose meaning but expressivity or capacity. I'm not sure that's recommended, it's not immoral or illegal.

1

DeepGamingAI t1_izwql0s wrote

>I'm not sure that's recommended, it's not immoral or illegal.

Humans may not consider that design choice immoral but I don't want to offend our soon-to-be AI overlords. Maybe I'll ask chatGPT if it will judge me for doing that.

1

tdgros t1_izwrdlr wrote

I am designing a vision transformer for image classification. What do you feel about inserting 1x1 convolutions in between transformer blocks to reduce the dimensionality? would you feel offended if I gradually did that throughout my backbone?

As a large language model trained by OpenAI, I don't have feelings and I am not capable of being offended. I am also not able to browse the internet, so I am unable to provide specific information about using 1x1 convolutions in a vision transformer for image classification. However, in general, using 1x1 convolutions can be a useful technique for reducing the dimensionality of an input in a convolutional neural network, which can help to improve the computational efficiency of the network and reduce the number of parameters it has. Whether or not this technique is appropriate for your specific use case will depend on the details of your model and the data you are working with. It's always a good idea to experiment with different architectures and techniques to see what works best for your particular application.

1

DeepGamingAI t1_izwthb9 wrote

It's just like a girlfriend. "No I will not be offended if you did this" but then goes ahead and takes it personally when you do it.

1