hysse t1_j30ub8q wrote on January 5, 2023 at 7:22 AM

Reply to comment by jakderrida in [D] Simple Questions Thread by AutoModerator

Haha I knew it. Unfortunately, I don't think ChatGPT can give a good answer to that question...

hysse t1_j30tpx8 wrote on January 5, 2023 at 7:15 AM

Reply to comment by jakderrida in [D] Simple Questions Thread by AutoModerator

Thanks for the answer. I need to train a relatively large model and I need an efficient tokenizer.

I don't see how a tokenizer written in pytorch (or tensorflow) can be faster than a HuggingFace tokenizer (for example). HuggingFace has a rust backend that make the tokenizer faster and I guess that torchtext has an optimized backend too.

Knowing that the tokenizer run in cpu and not gpu, how can it run faster if I wrote it with pytorch (or even in python) ?

hysse t1_j2qqwsf wrote on January 3, 2023 at 7:27 AM

Reply to [D] Simple Questions Thread by AutoModerator

Which tool is the best to train a tokenizer ? HuggingFace library seems the simplest one but is it the most efficient (computing) ? If yes, what torchtext, nltk... are useful for ?