Are there any small problems that I can use to test out transformers?

Something simple and easy to program. So that I can use a small transformer net and gauge its performance on my regular computer. Without the need for nutty training times.

My ultimate goal is to then modify the network and see if I can get similar/better results with some tweaks.

For example for a normal network I do simple function approximation like predict y from x where target y is actually: cos(sin(10*(x^2))^3)

But to test transformers I would probably need a finite sequence -> finite sequence testing function, and I can't think of one easily.

Comments

adam_jc t1_iqui2g0 wrote on October 3, 2022 at 4:37 AM

you can do n-digit addition of positive integers as a sequence where each digit is a token, i.e.

the problem 946 + 82 = 1028 could be made into sequence of:

9 | 4 | 6 | + | 0 | 8 | 2 | = | 1 | 0 | 2 | 8

(you could also omit + and = tokens).

Andrej Karpathy uses this task in his minGPT repo.

edit: also in that repo he does character level training on a tiny dataset of Shakespeare writing

sharp7 OP t1_iquj21o wrote on October 3, 2022 at 4:47 AM

Omg perfect thanks!

IntelArtiGen t1_iqu74lu wrote on October 3, 2022 at 2:54 AM

Transformers like the one in BERT have already defined tasks to train themselves without labels. You can use a corpus like Universal Dependencies if you want to predict labels on words / sentences but you can also just use any text and do tasks like "predict hidden words" or "predict next sentence", the way they are defined here: https://arxiv.org/pdf/1810.04805.pdf or any other way as long as it makes sense for the neural network, you can also use OPUS if you want to try translating sentences with the whole encoder-decoder architecture of the Transformer.

You probably don't need a high-end GPU to train a small transformer on a small corpus. I trained a basic transformer in 30min with an rtx2070s on europarl with just the masked word prediction task. If you don't have a GPU it'll be harder though, I never tried to train a very small Transformer, don't know how they scale. I guess you could try to predict masked words with ~100 sentences and a very small transformer and train that model on CPU.

If you're only testing the architecture of the transformer and not the embeddings you can start the model from pretrained embeddings it should speed up the training a lot.

sharp7 OP t1_iqu81jn wrote on October 3, 2022 at 3:02 AM

Hmm interesting that it only took you 30 min for europarl and masked word prediction. Do you have any links to more information about that dataset and task? I'm not familiar with masked word prediction. But that's pretty fast. Although I only have an old GTX 1060 6GB. Not sure how much worse that is than your rtx2070.

IntelArtiGen t1_iqv7vu7 wrote on October 3, 2022 at 10:30 AM

The task is described in the paper I linked (3.1, Task #1: Masked LM). Any implementation of BERT should use it, like this one.

sharp7 OP t1_iqz57zy wrote on October 4, 2022 at 3:58 AM

Thank you ty!!!