WigglyHypersurface
WigglyHypersurface OP t1_j5l49mq wrote
Reply to comment by dojoteef in [D] Embedding bags for LLMs by WigglyHypersurface
Thanks these are helpful. Seems like "embedding bag" is used in ML libraries but not always in papers.
Edit: from a quick look neither of these is actually just an embedding bag, rather different approaches to incorporating subword information.
WigglyHypersurface OP t1_j5l3vlk wrote
Reply to comment by terath in [D] Embedding bags for LLMs by WigglyHypersurface
I have - the whole point of my post is this limits information sharing across tokens, depending on the split.
So, for example, if the tokenizer splits the -ed off the end of a rare verb - like "refactored" but does not for a common verb, like "calmed" it splits representations for the verbal morphology into two, when really those -ed endings serve the same function.
WigglyHypersurface t1_j4gzweu wrote
Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691
The GPT2 LM is causal. If you do AutoModelForCausalLM with gpt2 it works fine.
WigglyHypersurface t1_j4gzjvd wrote
Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691
If you're messing with the weights that deeply and directly I'm not sure. But it smells like a bug to me.
WigglyHypersurface t1_j4grftr wrote
Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691
I think those are the same but make both the causal version and see.
WigglyHypersurface t1_j4gr979 wrote
Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691
The amount of code in the training data might effect specific task performance, even if the task itself involves no code. Seems to maybe be particularly the case for tasks requiring attention to long range dependencies and abstract reasoning.
WigglyHypersurface t1_j4gpm5i wrote
Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691
What kind of head is on the models for the task?
WigglyHypersurface t1_j4f1r8b wrote
Reply to [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691
Did you forget to change the tokenizer?
WigglyHypersurface t1_j0igsnc wrote
There is recent work on causal forests which also reinterprets forests as a kernel method. The same group also came up with local linear forests, which can help in cases where smoothness and/or extrapolation is desired.
https://arxiv.org/pdf/1510.04342 https://arxiv.org/pdf/1807.11408
WigglyHypersurface t1_izl6mi2 wrote
If you don't want to go full Bayesian there's always the good old bootstrap. Retrain the model as many times as possible over N replicates of your original data sampled with replacement, then take the variance of your errors over the N errors.
WigglyHypersurface t1_iw7qykq wrote
Reply to comment by Used-Routine-4461 in [D] When was the last time you wrote a custom neural net? by cautioushedonist
Search for MIWAE and notMIWAE to find the papers on the technique.
If your data is small and tabular than you can't really beat bayes. If your data is too big for bayes but just tabular than random forest imputation is pretty good. Or if you have specific hypotheses you know you will test you can do mice with SMCFCS.
The real utility of the (M)IWAE I think is when you have non-tabular data with missings. This is my use case. I have to impute a mixture of audio, string, and tabular data.
WigglyHypersurface t1_iw5xnxa wrote
Reply to comment by schwagggg in [D] When was the last time you wrote a custom neural net? by cautioushedonist
It's possible I'll use it down the line, but it's not currently in the model.
WigglyHypersurface t1_iw5lak8 wrote
Reply to comment by MontanaBananaJCabana in [D] When was the last time you wrote a custom neural net? by cautioushedonist
Importance weighted autoencoder.
WigglyHypersurface t1_iw5c2u1 wrote
Thankfully I'm doing niche enough projects I still get to. Last one was a multi-modal iwae for imputing missing data.
WigglyHypersurface OP t1_j5ldsn7 wrote
Reply to comment by terath in [D] Embedding bags for LLMs by WigglyHypersurface
The reason I'm curious is that FastText embeddings tend to work better on small corpora. I'm wondering if you took one of the small-data-efficient LLMs that you can train yourself on a few A100s (like ELECTRA) and changed the embeddings to a bag-of-character ngrams if you'd see further gains on small training sets.