hysse

hysse t1_j30tpx8 wrote

Thanks for the answer. I need to train a relatively large model and I need an efficient tokenizer.

I don't see how a tokenizer written in pytorch (or tensorflow) can be faster than a HuggingFace tokenizer (for example). HuggingFace has a rust backend that make the tokenizer faster and I guess that torchtext has an optimized backend too.

Knowing that the tokenizer run in cpu and not gpu, how can it run faster if I wrote it with pytorch (or even in python) ?

1

hysse t1_j2qqwsf wrote

Which tool is the best to train a tokenizer ? HuggingFace library seems the simplest one but is it the most efficient (computing) ? If yes, what torchtext, nltk... are useful for ?

3