Submitted by Simusid t3_1280rhi in MachineLearning
I'm making my very first attempt at VITMAE and I'd be interested in hearing about anyone's successes, tips, failures, etc.
I'm pretraining on 850K grayscale spectrograms of birdsongs. I'm on epoch 400 out of 800 and the loss has declined from about 1.2 to 0.7. I don't really have a sense of what is "good enough" and I guess the only way I can judge is by looking at the reconstruction. I'm doing that using this notebook as a guide and right now it's doing quite badly.
Ideally, I want to work towards building an embedding model specifically for acoustics (whether the input is a wav file time series or spectrogram images). Maybe I need 80M images instead of 800K, maybe I need a DGX A100 and a month to train it. Maybe this is a complete failure. I'm not sure right now but it's been very interesting to implement. Would love to hear your thoughts.
nbviewerbot t1_jegobo2 wrote
I see you've posted a GitHub link to a Jupyter Notebook! GitHub doesn't render large Jupyter Notebooks, so just in case, here is an nbviewer link to the notebook:
https://nbviewer.jupyter.org/url/github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb
Want to run the code yourself? Here is a binder link to start your own Jupyter server and try it out!
https://mybinder.org/v2/gh/NielsRogge/Transformers-Tutorials/master?filepath=ViTMAE%2FViT_MAE_visualization_demo.ipynb
^(I am a bot.) ^(Feedback) ^(|) ^(GitHub) ^(|) ^(Author)