This is just a collection of various links and observations that I came across while learning about tokenisation during the past week that would otherwise have no other home.
- NLTK and CLTK are two other NLP libraries from the pre-deep learning era. CLTK has a focus on classical languages, but my sense of NLTK is that it maybe hasn’t kept pace as much and I don’t plan to delve too deeply into where it is strong.
- via the ArabML community there is tkseem which offers tokenization for the Arabic language. Ideas to learn from in there, probably.
- I watched this video from the MARI conference (about which more soon), and there was a really good example of the importance of tokenization mentioned (pointing to this paper) for negation in Swahili
- Some interesting observations on how certain languages are disproportionately lossy when it comes to their text-to-token ratio. Feels like a useful area to research more.
- Apparently GPU tokenization is a thing, too, though unclear whether this is just NVIDIA making something so that they can sell more GPUs. (i.e. what is the need for this, given how fast it already runs)
More Questions
And some other questions (beyond my larger questions around how to evaluate tokenisers):
- How useful (or not) is data augmentation when it comes to training a tokenizer?
- Is a list of dictionary words useful for training a tokenizer?