Lab: Training Tokenizers#
In this lab lecture, we will focus on training various tokenizers on a corpus of Korean text. The corpus we are using is the wiki dataset from the Hugging Face Hub. We will be training Byte Pair Encoding (BPE), WordPiece, and Unigram tokenizers, and then compare their performance. Additionally, we will also train a tokenizer using SentencePiece.
Step 1: Install necessary libraries#
First, we need to install the necessary libraries. We’ll be using Hugging Face’s tokenizers library and the datasets library to load our corpus. We’ll also need the sentencepiece library. You can install them with:
%pip install tokenizers datasets sentencepiece
Step 2: Load the dataset#
We’ll be using the wiki dataset in Korean, which we can load using the datasets library.
from datasets import load_dataset
wiki = load_dataset("lcw99/wikipedia-korean-20221001")
Step 3: Prepare the text#
We’ll need to extract the actual text from our dataset for training our tokenizers.
# change this to your own path
wiki_filepath = "../tmp/wiki.txt"
text = "\n".join(article["text"] for article in wiki["train"])
with open(wiki_filepath, "w", encoding="utf-8") as f:
f.write(text)
Step 4: Train the tokenizers#
Now, we’ll train each of our tokenizers on our text.
Byte Pair Encoding (BPE)#
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
# change this to your own path
bpe_tokenizer_path = "../tmp/bpe_tokenizer.json"
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train([wiki_filepath])
tokenizer.save(bpe_tokenizer_path)
WordPiece#
from tokenizers.models import WordPiece
# change this to your own path
wordpiece_tokenizer_path = "../tmp/wordpiece_tokenizer.json"
tokenizer = Tokenizer(WordPiece())
tokenizer.train([wiki_filepath])
tokenizer.save(wordpiece_tokenizer_path)
Unigram#
from tokenizers.models import Unigram
# change this to your own path
unigram_tokenizer_path = "../tmp/unigram_tokenizer.json"
tokenizer = Tokenizer(Unigram())
tokenizer.train([wiki_filepath])
tokenizer.save(unigram_tokenizer_path)
SentencePiece#
For SentencePiece, we’ll use the SentencePiece library directly.
import sentencepiece as spm
sentencepiece_tokenizer_path = "../tmp/sentencepiece_tokenizer.model"
num_threads = 1
spm.SentencePieceTrainer.train(
"--input={} --model_prefix=sentencepiece --vocab_size=32000 --num_threads={}".format(wiki_filepath, num_threads)
)
!mv sentencepiece.* ./tmp/
Step 5: Compare the tokenizers#
Now, let’s load each tokenizer and see how they tokenize a sample sentence.
from tokenizers import Tokenizer
import sentencepiece as spm
sample_sentence = "안녕하세요. 이 문장은 토크나이저를 테스트하기 위한 샘플 문장입니다."
bpe_tokenizer_path = "../tmp/bpe_tokenizer.json"
wordpiece_tokenizer_path = "../tmp/wordpiece_tokenizer.json"
unigram_tokenizer_path = "../tmp/unigram_tokenizer.json"
sentencepiece_tokenizer_path = "../tmp/sentencepiece_tokenizer.model"
# BPE
bpe = Tokenizer.from_file(bpe_tokenizer_path)
print("BPE:", bpe.encode(sample_sentence).tokens)
# WordPiece
wordpiece = Tokenizer.from_file(wordpiece_tokenizer_path)
print("WordPiece:", wordpiece.encode(sample_sentence).tokens)
# Unigram
unigram = Tokenizer.from_file(unigram_tokenizer_path)
print("Unigram:", unigram.encode(sample_sentence).tokens)
# SentencePiece
spm = spm.SentencePieceProcessor()
spm.load(sentencepiece_tokenizer_path)
print("SentencePiece:", spm.encode_as_pieces(sample_sentence))
BPE: ['안', '녕', '하', '세', '요', '.', '이', '문', '장은', '토', '크', '나이', '저', '를', '테', '스트', '하기', '위한', '샘', '플', '문', '장', '입', '니다', '.']
WordPiece: ['안', '##녕', '##하', '##세', '##요', '##.', '## ', '##이', '## ', '##문', '##장', '##은', '## ', '##토', '##크', '##나', '##이', '##저', '##를', '## ', '##테', '##스', '##트', '##하', '##기', '## ', '##위', '##한', '## ', '##샘', '##플', '## ', '##문', '##장', '##입', '##니', '##다', '##.']
SentencePiece: ['▁', '안녕하세요', '.', '▁이', '▁문장', '은', '▁토크', '나', '이', '저', '를', '▁테스트', '하기', '▁위한', '▁샘플', '▁문장', '입니다', '.']
You should see that each tokenizer breaks down the sentence differently. Some might keep “안녕하세요” as one token, while others might break it down further. This is the core difference between these tokenization strategies.
Remember, there is no universally “best” tokenizer—it depends on the specific task and language. Try using these different
tokenizers in your models and see which one works best for your task!
Training Tokenizers for GPT, BERT, and T5#
In this section, we will train tokenizers using the same specifications as the GPT, BERT, and T5 models. These tokenizers are Byte Pair Encoding (BPE) for GPT, WordPiece for BERT, and Unigram for T5.
Byte Pair Encoding (BPE) - GPT#
GPT models use the Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 50,000. BPE was initially designed for data compression, but it has been shown to work well for tokenizing text in neural language models.
To train a BPE tokenizer with the same specifications as GPT:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer
wiki_filepath = "../tmp/wiki.txt"
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(
vocab_size=50000,
min_frequency=2,
special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
)
tokenizer.train([wiki_filepath], trainer)
tokenizer.save("../tmp/bpe_gpt_tokenizer.json")
WordPiece - BERT#
BERT models use the WordPiece tokenizer with a vocabulary size of 30,000. WordPiece is a data-driven tokenization strategy that allows for better handling of out-of-vocabulary (OOV) words.
To train a WordPiece tokenizer with the same specifications as BERT:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.normalizers import BertNormalizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = BertNormalizer(lowercase=False)
trainer = WordPieceTrainer(
vocab_size=30000, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)
tokenizer.train([wiki_filepath], trainer)
tokenizer.save("../tmp/wordpiece_bert_tokenizer.json")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 6
3 from tokenizers.trainers import WordPieceTrainer
5 tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
----> 6 tokenizer.normalizer = BertNormalizer(lowercase=False)
8 trainer = WordPieceTrainer(
9 vocab_size=30000, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
10 )
11 tokenizer.train([wiki_filepath], trainer)
NameError: name 'BertNormalizer' is not defined
Unigram - T5#
T5 models use the Unigram tokenizer with a vocabulary size of 32,000. The Unigram tokenizer is a subword regularization algorithm that learns a vocabulary based on the frequency of subwords in the training text.
To train a Unigram tokenizer with the same specifications as T5:
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(
vocab_size=32000, special_tokens=["<pad>", "</s>", "<unk>", "<s>"]
)
tokenizer.train(["wiki_ko.txt"], trainer)
tokenizer.save("unigram_t5_tokenizer.json")
In summary, we have trained three tokenizers using the same specifications as the GPT, BERT, and T5 models. BPE is used in GPT with a vocabulary size of 50,000, WordPiece is used in BERT with a vocabulary size of 30,000, and Unigram is used in T5 with a vocabulary size of 32,000. These tokenizers are all data-driven and provide different ways to handle subword tokenization, which can be beneficial for handling out-of-vocabulary words and improving the performance of the models.