Submitted by Fine-Topic-6127 t3_119ydqv in MachineLearning
martianunlimited t1_j9sh43x wrote
Not exactly what you are asking, but there is this paper on scaling law that states that (assuming that the training data is representative of the distribution) for at least large langauge models, how the performance of transformers scale to the amount of data and compare it to other network architecture.... https://arxiv.org/pdf/2001.08361.pdf we don't have anything similar for other types of data.
Viewing a single comment thread. View all comments