Submitted by asarig_ t3_10sj2qf in MachineLearning
janpf t1_j75zh5u wrote
Reply to comment by asarig_ in [R] Graph Mixer Networks by asarig_
Ha, the funny thing is that in the Google paper at least they replace the O(n^(2)) by a O(n*D_S), where D_S is a constant, so linear. But it so happens that D_S > n in their studies, so it's not really faster :) ... (edit: there is another constant in the transformers version also, but effectively the mixer was using same order of magnitute amount of TPU time to train)
But MLP-Mixers are a very interesting proposition anyway. Other types of mixers used are things like FFT (FNet).
Viewing a single comment thread. View all comments