In this blog I want to walk through how I trained my first tokenizer(s) on a small Balochi language corpus. I used the Huggingface Tokenizers library and FastAI / Spacy to get a sense of the interfaces involved. There’s also some naive pre-processing I did to get the corpus into a format that the tokenizer could handle. I’m not sure if this is the best way to do it, but it worked for this first iteration.

We can get straight into the implementation details, but the general process was:

Load in our data corpus
Pre-process the data (remove non-Balochi characters and numbers, etc.)
Load the algorithm we want to use for tokenisation (using BPE here)
Tokenise the text

I’ll go through each of these steps in turn.

Code

# !pip install datasets
# !huggingface-cli login
# from datasets import load_dataset
# load_dataset("balochiml/balochi-language-data", data_dir="data", cache_dir="../data")

Load our text corpus

Here I walk through my .txt files and load the paths into a list. You can see we have 4294 files to work with.

import os


def get_txt_file_paths(directory):
    txt_file_paths = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".txt"):
                file_path = os.path.join(root, file)
                txt_file_paths.append(file_path)
    return txt_file_paths


# Replace "directory_path" with the actual path of the directory you want to search
directory_path = "../data/raw_text"
txt_paths = get_txt_file_paths(directory_path)

len(txt_paths)

Pre-process the texts

I still don’t fully have a good sense of the best ways to do this, not least of all because I’m not sure of the tradeoffs for decisions I take. For example, I frequently hear that people remove punctuation during pre-processing, but I’m not sure how that’s helpful. It feels like you’d be removing context more than anything else.

I had similar thoughts on the removal of numbers, but in the end I removed them along with any a-z or A-Z English-language characters. I also removed excess whitespace.

import re

def clean_text(file_path):
    # Open the file and read it into memory
    with open(file_path, "r", encoding="utf-8") as file:
        text = file.read()

    # Remove English-language characters and numbers
    text = re.sub(r"[a-zA-Z0-9]", "", text)

    # Remove any excess whitespace
    text = re.sub(r"[^\S\n]+", " ", text)

    return text

for path in txt_paths:
    cleaned_text = clean_text(path)

    # write the cleaned text to a new file with an incremented filename
    # write the files all into the '../data/processed_text' directory
    with open(
        f'../data/processed_text/{path.split("/")[-1]}', "w", encoding="utf-8"
    ) as file:
        file.write(cleaned_text)

Training a Tokenizer using 🤗 Tokenizers

The process of ‘training’ a tokeniser using the Huggingface Tokenizers library was pretty straightforward. There are some nuances and parameters where – again – I’m not sure of the tradeoffs I’m making. I’ll mention those when I get to them.

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Here, for example, I’m pretty sure that the vocabulary size is an important hyperparameter to tune, as is the minimum frequency of tokens. The values here are the defaults in the library. I’ve read that a higher vocab size might be warranted in a language that is morphologically complex, but I don’t think that Balochi qualifies for that. Also, a larger vocabulary size might be warranted for a language for which I have a larger corpus.

from tokenizers.trainers import BpeTrainer

vocab_size = 30000

trainer = BpeTrainer(
    min_frequency=2,
    vocab_size=vocab_size,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    show_progress=True,
)

# get a list of all the txt files in
# '/Users/strickvl/balochi/balochi-tokenizer/data/processed_text'

processed_files = get_txt_file_paths("../data/processed_text")
assert len(processed_files) == len(txt_paths)
len(processed_files)

The training process itself was a matter of passing the files and the (configured) trainer into the .train() method. It was extremely quick to run, taking only around minutes to crunch through my corpus. (For reference, I’m now up to around 2.8 million words of Balochi text in the corpus, a drop in the ocean compared to the datasets used to trained English-language LLMs.)

tokenizer.train(processed_files, trainer)

tokenizer.model

<tokenizers.models.BPE at 0x108eaa830>

assert tokenizer.get_vocab_size() == vocab_size
tokenizer.get_vocab_size()

# tokenizer.get_vocab()

I also saved the tokenizer to disk so that I (or others) can load it in without needing the dataset at a later date. This saves a JSON file which contains all the information needed to load the tokenizer separately from the data.

tokenizer.save("../models/30k-balochi-tokenizer.json")

tokenizer = Tokenizer.from_file("../models/30k-balochi-tokenizer.json")

And here you can see the results on a sample from some Balochi text I found somewhere on the internet.

sample_text = "      آیک  جناورے اَت۔  لھتے گشیت آ سکیں کارزوالے ات کہ اگاں آزاتی دیگ بہ بیت، بازارءَ، لوگے ءَ، جاگاہ یے  ءَ،دپتر ء ُ کارگس یے  ءَ یا ھر ھما جاگاہ ءَ کہ شُت کنت مزنیں کارزوالی کنت۔گوں ھر کس ءَ جنگ ء ُ مڑ بیت۔گدء ُ پچاں  چنڈ چنڈ ء ُ راڑ راڑ کنت،کاگد ء ُ وانگیاں وارت ء ُ آدراہ کنت۔ورگی چیزاں اگاں وارت نکنت آھاں گٹ پاچیت ھراب کنت۔ایندگہ جناور چہ بندات ء َ ایشی ءِ کازوالیاں چہ وتا دیر دارگ ءِ کوشست کن اَنت۔ چیا کہ آ بازیں دگہ ھرابی ء ُ کارزوالی ھم کنت،پمیشکا کسانیں جناور  بالی مُرگ،کوہ پاچن،آسک ء ُ ایندگہ کسان کسانیں جناورچر آئی ءِ کارزوالیانی سوب ءَ آئی ءَ چہ سک باز شزار اَنت ۔".replace(
    "\xa0", ""
)
sample_sentence = sample_text.split("۔")[2]
sample_sentence

'گوں ھر کس ءَ جنگ ء ُ مڑ بیت'

tokenizer.encode(sample_sentence).tokens

['گوں', 'ھر', 'کس', 'ءَ', 'جنگ', 'ء', 'ُ', 'مڑ', 'بیت']

Judging by this tiny example, actually the tokenization process doesn’t seem to have saved us much in terms of space. The tokens from the encoded text are basically just the words from the original text.

Training a tokenizer using Spacy and FastAI

The FastAI course and book have a whole chapter that deals with NLP and a section that deals with tokenization and subwords so I thought I’d follow through that process as well to get a sense of the higher-level API that FastAI provides as well as the implementation under the hood provided by Spacy.

When you install FastAI, you’ll probably notice that it has Spacy as a dependency. This is because it uses Spacy under the hood for tokenization (along with a lot of other NLP tasks). FastAI provides a wrapper around Spacy’s Tokenizer object along with some helper functions and other bits and pieces.

I’ll admit to not finding the FastAI interface as intuitive or useful as the 🤗 Tokenizers library, in part because it was harder to get at some of the Spacy primitives when it became necessary to do so. More on this below.

from fastai.text.all import *
# a built-in helper function from fastai
files = get_text_files("../data/processed_text")

len(files)

# get some sample text from the first file
txt = files[0].open().read(); txt[:75]

'*آمیتگءِ جُستءَمکن* لچّہ: *آمیتگءِ جُستءَمکن* آ میتگءَکہ من وتی شوکیں کسانی'

# using the `SpacyTokenizer` from fastai
# see https://docs.fast.ai/text.core.html#spacytokenizer
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#146) ['*','آمیتگءِ','جُستءَمکن','*','لچّہ',':','*','آمیتگءِ','جُستءَمکن','*','آ','میتگءَکہ','من','وتی','شوکیں','کسانی','پیر','کُت','آ','میتگءِ','جسُتءَمکن','آ','میتگءِ','گیراں','مبو','بے','اوستیں','تاهیراں','مبو','آ'...]

tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#147) ['xxbos','*','آمیتگءِ','جُستءَمکن','*','لچّہ',':','*','آمیتگءِ','جُستءَمکن','*','آ','میتگءَکہ','من','وتی','شوکیں','کسانی','پیر','کُت','آ','میتگءِ','جسُتءَمکن','آ','میتگءِ','گیراں','مبو','بے','اوستیں','تاهیراں','مبو','آ'...]

txts = L(o.open().read() for o in files)

# get a sense for the subwords generated from a
# small slice of our text data
def subword(size: int):
    sp = SubwordTokenizer(vocab_sz=size)
    sp.setup(txts)
    return " ".join(first(sp([txt]))[:40])

subword(1000)

'▁* آ می تگ ءِ ▁جُست ءَ م ک ن * ▁لچّہ : ▁* آ می تگ ءِ ▁جُست ءَ م ک ن * ▁آ ▁میتگ ءَ کہ ▁من ▁وتی ▁ش وکیں ▁کس انی ▁پیر ▁کُت ▁آ ▁میتگ ءِ ▁ج'

subword(275)

'▁ * آ م ی ت گ ء ِ ▁ ج ُ س ت ء َ م ک ن * ▁ ل چ ّ ہ : ▁ * آ م ی ت گ ء ِ ▁ ج ُ س ت'

toks200 = txts[:200].map(tkn)
toks200[0]

(#147) ['xxbos','*','آمیتگءِ','جُستءَمکن','*','لچّہ',':','*','آمیتگءِ','جُستءَمکن'...]

At this point, once we’ve seen a bit how FastAI and Spacy are able to tokenize the text, we can switch into the numericalisation process and see what we get from our dataset.

num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,50)

"(#4096) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','ءَ','ءِ','ءُ','۔','کہ','،','انت','من','اے','نہ','وتی','بیت','”','ات','چہ','گوں','اَنت','اِنت','پہ','بہ','‘','یک','آئی','.','آ','منی','ھم',')','کنت','بلوچی','3','تو','بلے','ئے',':','کنگ','(','بوتگ','آں','کن','؟'...]"

You can see that some of the meta-tokens mentioned in my last blog are also represented here, and then the rest of the words are sorted by frequency order.

We can represent a sample of text as the token ids at this point:

nums = num(toks)[:20]; nums

TensorText([ 156, 2340,    0,  156,  563,   43,  156, 2340,    0,  156,   33,
               0,   16,   19, 1490,  831,  457,  102,   33, 1031])

When we convert this back, you’ll see we get the meta-tokens as well.

' '.join(num.vocab[o] for o in nums)

'* آمیتگءِ xxunk * لچّہ : * آمیتگءِ xxunk * آ xxunk من وتی شوکیں کسانی پیر کُت آ میتگءِ'

Lessons learned

This first attempt at tokenisation was instructive in a number of respects.

I didn’t show what was going on under the hood with the FastAI wrapper, but if you look at the source code you’ll see that the line spacy = WordTokenizer() assumes that the base language we’re dealing with is English. You can of course pass in a language code to the WordTokenizer initialization, but since it uses Spacy under the hood here and since Balochi isn’t represented as an official language supported by Spacy, when you’re basically out of luck. You hit an error and you can either continue using simplistic algorithms like the ones demonstrated above (essentially splitting on word delimiters) or you can abandon FastAI and dive into Spacy.

At that point, you’ll have to start implementing a whole bunch of things yourself in order to get going quickly. For example, you’ll ideally want to come up with all the list of punctuation marks, stop words, stemming rules and so on that I mentioned last time. (It might well be that it’s possible to get up and running faster for a non-standard language with Spacy, but it wasn’t clear to me how to do that.)

I do actually now intend to make a contribution to the Spacy repo to have Balochi represented there, and to open the window for others to contribute to the language metadata directly, but that didn’t help me in the moment. You’ll notice that I didn’t show how you can save a serialized version of the Spacy/FastAI tokeniser because I wasn’t able to figure out how to get access to the underlying Spacy object. I’m sure it’s possible since I can read the Spacy API documentation showing which method to use but FastAI didn’t itself expose this functionality directly.

My initial impression from working with both libraries and spending some time with their documentation is that Spacy might end up being more useful for low-resource languages given the extent to which they support a more complete range of old-school NLP methods and techniques. That said, the 🤗 Tokenizers library was much easier to get up and running with and I think it’s a great option for anyone who wants to get started quickly with tokenization. They support most of the major algorithms you’d ever need to use and if they don’t you can always implement something yourself to extend it.

Balochi Tokenizers on Huggingface Hub

I’m still working through a way to open up the core dataset (along with constructing as I work), but this first iteration of the tokenizer is now available over on the Huggingface Hub. You can load it for use with the single line:

tokenizer = Tokenizer.from_file("../models/30k-balochi-tokenizer.json")

The organisation is something I created together with some Balochi colleagues who expressed an interest in working together on this effort. I’m really happy to have made their acquaintance and I hope I’ll be able to make steady progress on this project with their help. (If you’re interested in contributing, please request access to the organization and/or contact me for more information.)

While creating the tokenizer repository, I also noted how Balochi (as with Spacy) is not represented as a language recognised by the metadata tracking languages used on the Hub. Frustratingly, you’re asked to input an ISO-639-1 two-letter code to represent the language of the model, but of course Balochi doesn’t have one of those. Balochi only has an ISO-693-2 and ISO-693-3 code. I’ll have to see how we can get Balochi represented on the Hub given all this. It can’t be the first time that this has happened.

Next steps

Now that I have this first iteration complete, I want to reflect a bit on how to know when the tokenizer is ‘good enough’. In particular, how do you evaluate tokenisers? Are there ways of benchmarking this? There must have been work done on this and I want to understand both what the start of the art is as well as how to know when I’ve reached it.

I also watched an extremely rewarding talk on low-resource languages (blog notes to follow!) and there was a section in that which stressed the foundational nature of tokenisation as part of language models. It also highlighted a failure mode where bad tokenisation made a model perform very badly on a certain kind of task. So based on this context I would like to understand how to evaluate tokenisers and how to know when I’ve reached a good enough point.

I also have a grab-bag of odds and ends relating to tokenization (GPU-based tokenization! tiktoken! etc.) that I mean to write up alongside the above.