Why are there so many Tokenization methods in HF Transformers?

Beri August 01, 2021

HuggingFace's transformers library is the de-facto standard for NLP - used by practitioners worldwide, it's powerful, flexible, and easy to use. It achieves this through a fairly large (and complex) code-base, which has resulted in the question:

"Why are there so many tokenization methods in HuggingFace transformers?"

Tokenization is the process of encoding a string of text into transformer-readable token ID integers. In this video we cover five different methods for this - do these all produce the same output, or is there a difference between them?

📙 Medium article:
https://towardsdatascience.com/why-are-there-so-many-tokenization-methods-for-transformers-a340e493b3a8

📖 Free link:
https://towardsdatascience.com/why-are-there-so-many-tokenization-methods-for-transformers-a340e493b3a8?sk=4a7e8c88d331aef9103e153b5b799ff5

🤖 70% Discount on the NLP With Transformers in Python course:
https://www.udemy.com/course/nlp-with-transformers/?couponCode=MEDIUM2

🕹️ Free AI-Powered Code Refactoring with Sourcery:
https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff

pythonmachine learningdata science