WigglyHypersurface OP t1_j5ldsn7 wrote on January 23, 2023 at 8:52 PM

Reply to comment by terath in [D] Embedding bags for LLMs by WigglyHypersurface

The reason I'm curious is that FastText embeddings tend to work better on small corpora. I'm wondering if you took one of the small-data-efficient LLMs that you can train yourself on a few A100s (like ELECTRA) and changed the embeddings to a bag-of-character ngrams if you'd see further gains on small training sets.

WigglyHypersurface OP t1_j5l49mq wrote on January 23, 2023 at 7:53 PM

Reply to comment by dojoteef in [D] Embedding bags for LLMs by WigglyHypersurface

Thanks these are helpful. Seems like "embedding bag" is used in ML libraries but not always in papers.

Edit: from a quick look neither of these is actually just an embedding bag, rather different approaches to incorporating subword information.

WigglyHypersurface OP t1_j5l3vlk wrote on January 23, 2023 at 7:50 PM

Reply to comment by terath in [D] Embedding bags for LLMs by WigglyHypersurface

I have - the whole point of my post is this limits information sharing across tokens, depending on the split.

So, for example, if the tokenizer splits the -ed off the end of a rare verb - like "refactored" but does not for a common verb, like "calmed" it splits representations for the verbal morphology into two, when really those -ed endings serve the same function.

WigglyHypersurface t1_j4gzweu wrote on January 15, 2023 at 5:47 PM

Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691

The GPT2 LM is causal. If you do AutoModelForCausalLM with gpt2 it works fine.

WigglyHypersurface t1_j4gzjvd wrote on January 15, 2023 at 5:45 PM

Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691

If you're messing with the weights that deeply and directly I'm not sure. But it smells like a bug to me.

WigglyHypersurface t1_j4grftr wrote on January 15, 2023 at 4:54 PM

Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691

I think those are the same but make both the causal version and see.

WigglyHypersurface t1_j4gr979 wrote on January 15, 2023 at 4:53 PM

Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691

The amount of code in the training data might effect specific task performance, even if the task itself involves no code. Seems to maybe be particularly the case for tasks requiring attention to long range dependencies and abstract reasoning.

WigglyHypersurface t1_j4gpm5i wrote on January 15, 2023 at 4:42 PM

Reply to comment by GasZealousideal8691 in [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691

What kind of head is on the models for the task?

WigglyHypersurface t1_j4f1r8b wrote on January 15, 2023 at 6:23 AM

Reply to [D] Is there any reason hugging face GPT2 would behave (fundamentally) differently from GPT-Neo? by GasZealousideal8691

Did you forget to change the tokenizer?

WigglyHypersurface t1_j0igsnc wrote on December 16, 2022 at 9:54 PM

Reply to [R] Are there open research problems in random forests? by SpookyTardigrade

There is recent work on causal forests which also reinterprets forests as a kernel method. The same group also came up with local linear forests, which can help in cases where smoothness and/or extrapolation is desired.

https://arxiv.org/pdf/1510.04342 https://arxiv.org/pdf/1807.11408

WigglyHypersurface t1_izl6mi2 wrote on December 9, 2022 at 10:25 PM

Reply to [D] Making a regression NN estimate its own regression error by Alex-S-S

If you don't want to go full Bayesian there's always the good old bootstrap. Retrain the model as many times as possible over N replicates of your original data sampled with replacement, then take the variance of your errors over the N errors.

WigglyHypersurface t1_iw7qykq wrote on November 13, 2022 at 4:45 PM

Reply to comment by Used-Routine-4461 in [D] When was the last time you wrote a custom neural net? by cautioushedonist

Search for MIWAE and notMIWAE to find the papers on the technique.

If your data is small and tabular than you can't really beat bayes. If your data is too big for bayes but just tabular than random forest imputation is pretty good. Or if you have specific hypotheses you know you will test you can do mice with SMCFCS.

The real utility of the (M)IWAE I think is when you have non-tabular data with missings. This is my use case. I have to impute a mixture of audio, string, and tabular data.

WigglyHypersurface t1_iw5xnxa wrote on November 13, 2022 at 5:08 AM

Reply to comment by schwagggg in [D] When was the last time you wrote a custom neural net? by cautioushedonist

It's possible I'll use it down the line, but it's not currently in the model.

WigglyHypersurface t1_iw5lak8 wrote on November 13, 2022 at 3:18 AM

Reply to comment by MontanaBananaJCabana in [D] When was the last time you wrote a custom neural net? by cautioushedonist

Importance weighted autoencoder.

WigglyHypersurface t1_iw5c2u1 wrote on November 13, 2022 at 2:02 AM

Reply to [D] When was the last time you wrote a custom neural net? by cautioushedonist

Thankfully I'm doing niche enough projects I still get to. Last one was a multi-modal iwae for imputing missing data.