Submitted by Simusid t3_1280rhi in MachineLearning
I'm making my very first attempt at VITMAE and I'd be interested in hearing about anyone's successes, tips, failures, etc.
I'm pretraining on 850K grayscale spectrograms of birdsongs. I'm on epoch 400 out of 800 and the loss has declined from about 1.2 to 0.7. I don't really have a sense of what is "good enough" and I guess the only way I can judge is by looking at the reconstruction. I'm doing that using this notebook as a guide and right now it's doing quite badly.
Ideally, I want to work towards building an embedding model specifically for acoustics (whether the input is a wav file time series or spectrogram images). Maybe I need 80M images instead of 800K, maybe I need a DGX A100 and a month to train it. Maybe this is a complete failure. I'm not sure right now but it's been very interesting to implement. Would love to hear your thoughts.
IntelArtiGen t1_jeguknc wrote
I've used autoencoders on spectrograms and in theory you don't need an A100 or 80M spectrograms to have some results.
I've not used ViTMAE specifically but I read similar papers. I'm not sure on how to interpret the value of the loss. You can use some tips which are valid for most of DL projects. Can your model overfit on a smaller version of your dataset (1000 spectrograms) ? If yes, perhaps your model isn't large / efficient enough to process your whole dataset (though bird songs shouldn't be that hard to learn imo). At least you could easily do more epochs faster with this method and debug some parameters. If your model can't overfit, you may have a problem in your pre/post processing.
Do ViTMAE models need normalized inputs? Spectrograms can have large values by default which may not be easy to process, they may be hard to normalize. Your input and your output should be in a coherent range of values and you should use the right layers in your model if you want that to happen. Also fp16 training can mess up with that.
ViTMAE isn't specifically for sounds right? I think there have been multiple attemps to use it for sounds, this paper (https://arxiv.org/pdf/2212.09058v1.pdf) cites other papers:
>Inspired by the success of the recent visual pre-training method MAE [He et al., 2022], MSM-MAE [Niizumi et al., 2022], MaskSpec [Chong et al., 2022], MAE-AST [Baade et al., 2022] and Audio-MAE [Xu et al., 2022] learn the audio representations following the Transformer-based encoder-decoder design and reconstruction pre-training task in MAE
You can try to see their results and how they made it work, these papers probably also published their code.
Be careful with how you process sounds, the pre/post processing is different than for images which may induce some problems.