# Lab: Pretraining LMs - MLM

## Introduction

In this lab session, we'll delve into the core of transformer-based models, specifically focusing on the BERT model, which utilizes a unique pretraining method called Masked Language Modeling (MLM). Our objective is to learn how to train a small-scale BERT model from scratch with a subset of the Wikipedia dataset. We will accomplish this task using Google Colab, a cloud-based programming environment that offers free access to computational resources, including GPUs which are ideal for tasks like training deep learning models.

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. It was created and published in 2018 by Jacob Devlin and his colleagues from Google. Unlike the traditional approach of training language models in either a left-to-right or right-to-left direction, BERT is trained bidirectionally. This bidirectional training approach allows the model to understand the context of a word based on all of its surroundings (left and right of the word).

The major characteristic of BERT is its use of MLM during pretraining. In MLM, a random subset of the input tokens is masked, and the objective for the model is to predict the original vocabulary id of the masked word, given its context. Unlike next sentence prediction, MLM can be defined over any input sequence and doesn't require explicitly labeled pairs of sentences as input.

The procedure for training our own BERT model involves the following steps:

1. **Preparing the Environment**: Set up Google Colab and ensure it has a functioning GPU.

2. **Preparing the Dataset**: Fetch and preprocess the subset of the Wikipedia dataset.

3. **Creating a Tokenizer**: Train a tokenizer from scratch. This tokenizer will be used to split our text into tokens (words, subwords) that can be understood by our model.

4. **Formatting the Data**: Format the data to match the input expected by the BERT model, and to implement the MLM.

5. **Defining the Model**: Define a small-scale version of the BERT model architecture.

6. **Training the Model**: Train the model using the prepared data and the MLM objective.

7. **Validating the Model**: Finally, evaluate the model's performance by running it on a validation dataset.

8. **Saving and Loading the Model**: Save the model for later use and demonstrate how to load it.

By the end of this lab session, you'll gain a deeper understanding of how transformer-based models are pretrained, and how you can implement this in practice. This is a crucial skill for any data scientist or ML engineer, given the prominence and success of transformer models in a wide range of NLP tasks.


## Preparing the Environment

In this section, we will prepare our development environment on Google Colab.

1. **Accessing Google Colab**:

   Visit the Google Colab website at https://colab.research.google.com. You may need to sign in with your Google account. If you don't have one, you'll need to create an account.

2. **Creating a New Notebook**:

   Once you're signed in, create a new Python 3 notebook by clicking on 'File' > 'New notebook'.

3. **Setting Up the GPU**:

   To train our model, we need to utilize a GPU. Google Colab offers free GPU resources and we can easily set it up by doing the following:

   - Click on 'Runtime' in the menu.
   - Select 'Change runtime type'.
   - Choose 'GPU' from the dropdown menu next to 'Hardware accelerator'.
   - Click on 'Save'.

   This will ensure that our model is trained on a GPU which is much faster than using a CPU.

4. **Installing Necessary Libraries**:

   For this exercise, we will primarily need the Hugging Face Transformers and Datasets library. Hugging Face Transformers provides us with the necessary tools and models for working with BERT and other transformer models, while the Datasets library helps us handle datasets.

   To install these, we will use the following pip commands in a new cell:

   ```python
   !pip install transformers
   !pip install datasets
   ```

   After running these commands, the necessary libraries should be installed.

5. **Verifying the Environment**:

   After setting up the GPU and installing the required libraries, it is recommended to verify the setup. You can check the GPU and the installed libraries with the following commands:

   ```python
   # Check the GPU
   !nvidia-smi

   # Check the transformers version
   !pip show transformers

   # Check the datasets version
   !pip show datasets
   ```

   The `nvidia-smi` command should display information about the GPU, and the `pip show` commands should display information about the installed libraries.

Congratulations, you have now successfully prepared your Google Colab environment! We are now ready to move on to the data preparation and tokenizer training stages.


In [None]:
%pip install transformers datasets accelerate apache_beam


In [1]:
!nvidia-smi

Wed May 31 11:50:23 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX A6000                Off| 00000000:01:00.0 Off |                  Off |
|  0%   29C    P8               15W / 300W|   2394MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Preparing the Dataset

In this section, we will prepare the subset of the Wikipedia dataset for pretraining our BERT model.

The Hugging Face's `datasets` library provides easy access to the Wikipedia dataset. We can load the dataset as follows:


In [1]:
from datasets import load_dataset

# Load a subset of the Wikipedia dataset
dataset = load_dataset("wikipedia", "20220301.en", split="train[:10%]")

# Display the first few entries in the dataset
print(dataset[:5])

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset wikipedia (/home/yj.lee/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


{'id': ['12', '25', '39', '290', '303'], 'url': ['https://en.wikipedia.org/wiki/Anarchism', 'https://en.wikipedia.org/wiki/Autism', 'https://en.wikipedia.org/wiki/Albedo', 'https://en.wikipedia.org/wiki/A', 'https://en.wikipedia.org/wiki/Alabama'], 'title': ['Anarchism', 'Autism', 'Albedo', 'A', 'Alabama'], 'text': ['Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism.\n\nHumans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empire

```python

```

The '20220301.en' version of the Wikipedia dataset represents the English Wikipedia dump on March 1, 2022. The `split` argument indicates that we're taking only the first 10% of the dataset for this example.

In real-world projects, you'll likely want to use more (or possibly all) of the data. However, for the sake of time and computational resources, we're only using a subset in this instance.

Our next step is to preprocess this raw text data to a format suitable for training a BERT model. BERT expects input data in a specific format, namely tokenized into words and subwords in such a way that it can be converted into token IDs understood by the model.

Since we're training a new tokenizer as part of this lab, our preprocessing at this stage will be minimal - we'll focus on cleaning the text. More advanced preprocessing (like tokenization) will happen after we have a trained tokenizer.

Our text cleaning will include:

- Removing newline characters, as they can interfere with our model's understanding of sentence structures.
- Optionally, removing any special characters or unnecessary whitespace.

Here's how we can accomplish this:


In [2]:
def clean_text(text):
    text = text.replace("\n", " ")
    text = text.replace("\r", " ")
    text = " ".join(text.split())
    return text


# Apply the cleaning function to the dataset
dataset = dataset.map(lambda x: {"text": clean_text(x["text"])})

Loading cached processed dataset at /home/yj.lee/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-3f340121f0986bf8.arrow


In [6]:
# change this to your own path
wiki_filepath = "../tmp/wiki.txt"

text = "\n".join(dataset["text"])
with open(wiki_filepath, "w", encoding="utf-8") as f:
    f.write(text)

This `clean_text` function removes newline and return characters, then squashes any sequences of whitespace into single spaces. We apply this function to our dataset using the `map` method, which applies a given function to each element in the dataset.

The dataset is now cleaned and ready to be used for tokenizer training.

## Creating a Tokenizer

After cleaning our dataset, we will train a tokenizer from scratch. A tokenizer breaks down text into smaller pieces, called tokens. BERT uses a special kind of tokenizer known as a WordPiece tokenizer, which breaks words into word parts.

We will use the Tokenizers library from Hugging Face, which offers a high-level API for training tokenizers.

To train a tokenizer, we first need to instantiate a `BertWordPieceTokenizer` with some initial parameters. Then, we will train it with our dataset.


In [7]:
from tokenizers import BertWordPieceTokenizer

# Initialize a tokenizer
tokenizer = BertWordPieceTokenizer(
    clean_text=False,  # We have already cleaned the text
    handle_chinese_chars=False,
    strip_accents=False,  # We keep accents
    lowercase=False,  # We keep the case info
)

# And then train
tokenizer.train(
    files=[wiki_filepath],
    vocab_size=30000,
    min_frequency=2,
    show_progress=True,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
    limit_alphabet=1000,
    wordpieces_prefix="##",
)

# Save the tokenizer
tokenizer.save("../tmp/bert_base_uncased.json")






['../tmp/bert_base_uncased-vocab.txt']

In this example, we have set a vocabulary size of 30,000 and a minimum frequency of 2, which means that a token must appear at least twice in the dataset to be included in the vocabulary. The `special_tokens` parameter includes special tokens needed in the BERT architecture. The `wordpieces_prefix` parameter sets the prefix of the subwords that the tokenizer will use.

After training, we save the tokenizer into files for later usage:

```python
# Save the tokenizer
tokenizer.save("bert_base_uncased")
```

This command will generate two files: `bert_base_uncased-vocab.txt` and `bert_base_uncased-merges.txt`. The `vocab.txt` file contains the vocabulary of our tokenizer, i.e., all the tokens it can recognize. The `merges.txt` file contains the rules to split words into subwords if they're not in the vocabulary.

This completes the section on creating a tokenizer. With our tokenizer trained, we're ready to format our data for the BERT model and implement MLM.

## Formatting the Data

After training our tokenizer, we will now format the data to be compatible with the BERT model and the MLM objective. We need to take our text data and tokenize it using our new tokenizer. Additionally, we must format the tokenized data in such a way that it fits the MLM objective.

Here are the steps:

1. **Tokenizing the Data**

First, we tokenize the cleaned text data:


In [3]:
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

# Load the trained tokenizer
tokenizer_obj = Tokenizer.from_file("../tmp/bert_base_uncased.json")

# Create a PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer_obj,
    unk_token="[UNK]",
    sep_token="[SEP]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    mask_token="[MASK]",
)
tokenizer.save_pretrained("../tmp/bert_base_uncased")

('../tmp/bert_base_uncased/tokenizer_config.json',
 '../tmp/bert_base_uncased/special_tokens_map.json',
 '../tmp/bert_base_uncased/tokenizer.json')

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("../tmp/bert_base_uncased")
print(f"is_fast: {tokenizer.is_fast}")

is_fast: True


In [5]:
text_column = "text"


def tokenize(element):
    outputs = tokenizer(
        element[text_column],
        truncation=True,
        max_length=512,
        return_special_tokens_mask=True,
    )
    return outputs


# Tokenize the text
tokenized_dataset = dataset.map(
    tokenize, batched=True, remove_columns=[text_column], num_proc=20
)
tokenized_dataset.features

Loading cached processed dataset at /home/yj.lee/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-fb130a6169eb8f15_*_of_00020.arrow


{'id': Value(dtype='string', id=None),
 'url': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'special_tokens_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

This will give us a dataset of tokenized texts, where each word or subword in the original text is replaced with its corresponding token ID from the tokenizer's vocabulary.

2. **Formatting for MLM**

Next, we need to format this tokenized data for the MLM objective. For MLM, we randomly mask some of the tokens and try to predict them given the non-masked tokens.

The Hugging Face's `datasets` library provides a function to prepare data for the MLM task:


In [6]:
from transformers import DataCollatorForLanguageModeling

# Define the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,  # Activate masked language modeling
    mlm_probability=0.15,  # 15% of the tokens will be masked for prediction
)

The `DataCollatorForLanguageModeling` function takes our tokenized text data and produces batches of data ready for MLM. It randomly masks 15% of the tokens (as defined by `mlm_probability`) in each sequence, and sets up the labels necessary for the MLM objective.

Now, the dataset is fully prepared and ready to be used for model training. The tokenized data, along with the data collator, will be fed into our model during the training phase.

## Defining the Model

Now that we have prepared our dataset and tokenizer, we can define our BERT model. We will use the Hugging Face's `transformers` library, which provides an easy-to-use implementation of the BERT model. We will use a smaller variant of BERT for faster training.

Here's how to define our BERT model:


In [7]:
from transformers import BertConfig, BertForMaskedLM

# Define the configuration
config = BertConfig(
    vocab_size=30_000,  # we set this to the length of our tokenizer
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=512,
    type_vocab_size=2,
    initializer_range=0.02,
    layer_norm_eps=1e-12,
    pad_token_id=tokenizer.pad_token_id,
    gradient_checkpointing=False,
)

# Instantiate the model
model = BertForMaskedLM(config=config)

# print the model size
model_size = sum(t.numel() for t in model.parameters())
print(f"BERT size: {model_size/1000**2:.1f}M parameters")

BERT size: 109.1M parameters


This BERT model has similar architecture to the original BERT-base model, including:

- `vocab_size`: The size of the vocabulary, which is the same as the one we used to train our tokenizer.
- `hidden_size`: The size of the hidden layers in the Transformer model.
- `num_hidden_layers`: The number of hidden layers in the Transformer model.
- `num_attention_heads`: The number of attention heads in the self-attention mechanisms.
- `intermediate_size`: The size of the "intermediate" layer, i.e., the size of the output tensors after each self-attention layer in a Transformer block.
- `hidden_act`: The non-linear activation function in the encoder and pooler.
- `hidden_dropout_prob`: The dropout probability for all fully connected layers in the embeddings and encoder.
- `attention_probs_dropout_prob`: The dropout probability for the attention probabilities.
- `max_position_embeddings`: The maximum sequence length that this model might ever be used with.
- `type_vocab_size`: The vocabulary size of the `token_type_ids` passed into `BertModel`.
- `initializer_range`: The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
- `layer_norm_eps`: The epsilon used by the layer normalization layers.
- `pad_token_id`: The id of the padding token.
- `gradient_checkpointing`: If True, use gradient checkpointing to save memory at the expense of slower backward pass.

With this, we've successfully defined our BERT model. In the next section, we will train this model with our dataset.

## Training the Model

Now that we have our model and our data prepared, we can move on to training our model. We will use the Hugging Face's `Trainer` class to handle the training process. This class simplifies the training process and allows us to focus on the model design and performance.

Here's how to train the model:


In [8]:
from transformers import Trainer, TrainingArguments

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./tmp/bert_base_uncased",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    fp16=True,
)

# Instantiate the trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

In this example, we set the number of training epochs to 1 for faster execution. In real-world applications, you might want to increase this value to improve your model's performance. We also specify a batch size of 16, which is a reasonable choice for this kind of model. The `save_steps` and `save_total_limit` parameters control how often the model is saved during training and how many total checkpoints to keep.

To begin training, we call the `train` method on our trainer:


In [9]:
from accelerate import Accelerator

accelerator = Accelerator()
acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}
device = accelerator.device

print(f"device: {device}")

trainer = accelerator.prepare(trainer)

trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


device: cuda


Step,Training Loss
500,7.4609
1000,6.9992
1500,6.8658
2000,6.7688
2500,6.6994
3000,6.6533
3500,6.5965
4000,6.5429
4500,6.4963
5000,6.4609


TrainOutput(global_step=20184, training_loss=5.933551393706223, metrics={'train_runtime': 8160.2738, 'train_samples_per_second': 79.148, 'train_steps_per_second': 2.473, 'total_flos': 1.6999426712671027e+17, 'train_loss': 5.933551393706223, 'epoch': 1.0})

This will train the model on the tokenized dataset. Because we're using the MLM objective, the model will learn to understand the relationships between different words and their contexts. The duration of the training process will depend on the size of your dataset and the capabilities of your GPU.

After training, the model will be ready for evaluation and further usage.

## Validating the Model

After training our model, we need to validate it to check how well it has learned to predict masked tokens. For this, we can create a validation dataset, preprocess and tokenize it in the same way we did for the training data, and then use the `Trainer` class to evaluate the model's performance.

Here's how we can do this:

1. **Create the Validation Dataset**

First, we need a separate validation dataset. We can again use a subset of the Wikipedia dataset for this purpose. Note that this subset should be different from the training set.


In [15]:
# Load a subset of the Wikipedia dataset for validation
valid_dataset = load_dataset("wikipedia", "20220301.en", split="train[10%:12%]")

Found cached dataset wikipedia (/home/yj.lee/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


2. **Clean, Tokenize and Format the Validation Data**

We then preprocess, tokenize, and format this validation dataset in the same way as we did for the training data:


In [16]:
# Clean the validation dataset
valid_dataset = valid_dataset.map(lambda x: {"text": clean_text(x["text"])})

# Tokenize the validation dataset
valid_tokenized_dataset = dataset.map(
    tokenize, batched=True, remove_columns=["text"], num_proc=20
)
valid_tokenized_dataset.features

Loading cached processed dataset at /home/yj.lee/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559/cache-f70a50493a56c656_*_of_00020.arrow


{'id': Value(dtype='string', id=None),
 'url': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'special_tokens_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

3. **Evaluate the Model**

We can now use the `evaluate` method of the `Trainer` class to validate the model:


In [17]:
# Evaluate the model
eval_results = trainer.evaluate(valid_tokenized_dataset)

# Print the evaluation results
print(eval_results)

{'eval_loss': 4.76074743270874, 'eval_runtime': 2831.6503, 'eval_samples_per_second': 228.089, 'eval_steps_per_second': 28.511, 'epoch': 1.0}


The `evaluate` method returns a dictionary containing the evaluation results. The exact contents of this dictionary will depend on the model and the training objective, but for an MLM model like BERT, it should contain the loss on the validation data (under the key `'eval_loss'`). This value represents how well the model can predict masked tokens in data it hasn't seen during training.

This concludes the validation step. After validating your model, you can go back and fine-tune the model parameters or adjust the training process to improve the model's performance, or you can move forward and use the model for downstream tasks.

## Saving and Loading the Model

After training and validating our model, we want to save it so that we can use it in the future. Likewise, we want to know how to load the model from disk for future usage or for fine-tuning on new data.

Here's how we can do it:

### Saving the Model

To save both the model and the tokenizer we've trained, we can use the `save_model` and `save_pretrained` methods from the `Trainer` class and the `tokenizer` instance, respectively:


In [10]:
# Save the model
trainer.save_model("../tmp/bert_base_uncased")

# Save the tokenizer
tokenizer.save_pretrained("../tmp/bert_base_uncased")

('../tmp/bert_base_uncased/tokenizer_config.json',
 '../tmp/bert_base_uncased/special_tokens_map.json',
 '../tmp/bert_base_uncased/tokenizer.json')

This will create several files in the `bert_base_uncased` directory. For the model, it will create `pytorch_model.bin` (which contains the model weights) and `config.json` (which stores the configuration of the model). For the tokenizer, it will create `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt` and `merges.txt`.

## Loading the Model

If we want to load our model and tokenizer in the future, we can use the `from_pretrained` methods from the `BertForMaskedLM` and `BertWordPieceTokenizer` classes:


In [None]:
# Load the model
model = BertForMaskedLM.from_pretrained("../tmp/bert_base_uncased")

# Load the tokenizer
tokenizer = BertWordPieceTokenizer.from_pretrained("../tmp/bert_base_uncased")

These methods will automatically find the necessary files in the specified directory and load them into memory.

This concludes the saving and loading step. Once you have your model and tokenizer saved, you can use them to solve any downstream tasks, or you can share them with others.
