# Lab: Pretraining LMs - CLM

## Introduction

In this lab, we will dive into the process of pretraining a Causal Language Model (CLM) from scratch using a subset of the Wiki dataset. Specifically, we will be training a small GPT-2 model. The goal of pretraining is to learn good representations of our data before fine-tuning the model on a specific task.

Causal Language Modeling, or Autoregressive Language Modeling, is a type of language modeling where the model makes predictions of the next token in the sequence, given the tokens that came before it. GPT-2 is a transformer model that was designed for this type of task.

The steps involved in pretraining a CLM are as follows:

1. **Preparing the Environment**: Setting up the necessary libraries and frameworks for training the model.

2. **Preparing the Dataset**: Downloading and preparing the Wiki dataset for our pretraining task.

3. **Creating a Tokenizer**: Training a tokenizer from scratch on our dataset.

4. **Formatting the Data**: Preparing our dataset in a format suitable for training our GPT-2 model.

5. **Defining the Model**: Initializing a GPT-2 model configuration.

6. **Training the Model**: Training our GPT-2 model on our prepared dataset.

7. **Validating the Model**: Checking the performance of our model on a holdout validation dataset.

8. **Saving and Loading the Model**: Storing our trained model for later use and loading it back into memory when needed.

By the end of this lab, you will have a working GPT-2 model that has been pretrained on a subset of the Wiki dataset, and a tokenizer that has been trained on the same data. This model will be ready to be fine-tuned on a downstream task of your choice.

In the next sections, we will delve into each of these steps in more detail.


## Preparing the Environment

In this section, we will set up our working environment on Google Colab. We'll need to install the appropriate libraries and dependencies required for pretraining our GPT-2 model.

**Step 1: Update and Upgrade the System**

First, we'll update and upgrade the system to ensure that we have the latest software and libraries available.

```python
!apt-get update
!apt-get upgrade -y
```

**Step 2: Install Required Libraries**

Next, we'll need to install the required Python libraries, which include Hugging Face's `transformers` and `datasets` libraries, and the `tokenizers` library.

```python
!pip install transformers
!pip install datasets
!pip install tokenizers
```

The `transformers` library provides us with the implementation of the GPT-2 model, along with various utilities for working with transformer models.

The `datasets` library gives us access to a wide variety of datasets and also provides utilities for working with datasets in an efficient manner.

The `tokenizers` library is used for training our tokenizer from scratch.

**Step 3: Import Necessary Libraries**

Lastly, we'll import the necessary libraries into our Python environment.


In [1]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    TextDataset,
    DataCollatorForLanguageModeling,
)
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, processors

  from .autonotebook import tqdm as notebook_tqdm


With these in place, we are now ready to proceed to the next step, which is preparing our dataset for pretraining our model.

Remember that we are working with the Google Colab environment, which comes preloaded with many necessary libraries. However, depending on your specific needs, you may need to install or import additional libraries.

## Preparing the Dataset

In this section, we will download and prepare a subset of the Wiki dataset for pretraining our GPT-2 model. The `datasets` library from Hugging Face provides an easy and efficient way to load the dataset.

**Step 1: Download the Dataset**

We will use the `load_dataset` function from the `datasets` library to download our dataset.


In [2]:
from datasets import load_dataset

# Load a subset of the Wikipedia dataset
dataset = load_dataset("wikipedia", "20220301.en", split="train[:10%]")

# Display the first few entries in the dataset
print(dataset[:5])

Found cached dataset wikipedia (/home/yj.lee/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


{'id': ['12', '25', '39', '290', '303'], 'url': ['https://en.wikipedia.org/wiki/Anarchism', 'https://en.wikipedia.org/wiki/Autism', 'https://en.wikipedia.org/wiki/Albedo', 'https://en.wikipedia.org/wiki/A', 'https://en.wikipedia.org/wiki/Alabama'], 'title': ['Anarchism', 'Autism', 'Albedo', 'A', 'Alabama'], 'text': ['Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism.\n\nHumans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empire

In the above code, we are loading only 10% of the training split of the Wiki dataset to make the training process faster. Feel free to adjust this percentage according to your computational resources and needs.

**Step 2: Preprocess the Dataset**

Before we can use this dataset for training our model, we need to preprocess the text. This usually involves cleaning the text and converting it to lowercase.


In [4]:
def preprocess_function(text):
    return text.strip().lower()


# Apply preprocessing to the dataset
dataset = dataset.map(lambda x: {"text": preprocess_function(x["text"])})

                                                                      

```python

```

In the above code, we define a preprocessing function that takes in a batch of examples and returns the preprocessed text. We then apply this function to our dataset using the `map` method.

This concludes the preparation of our dataset. The dataset is now ready to be used for training our tokenizer and GPT-2 model.

In the next section, we will train a tokenizer from scratch using our preprocessed dataset.

## Creating a Tokenizer

Tokenization is the process of splitting a sequence of text into individual tokens, which are usually words, subwords, or characters. In this section, we will create and train a tokenizer from scratch using our preprocessed Wiki dataset. We will use the `tokenizers` library for this.

**Step 1: Initialize a Byte-Level BPE Tokenizer**

GPT-2 uses a Byte-Level BPE (Byte Pair Encoding) tokenizer, so we'll initialize one.


In [17]:
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# GPT-2 uses byte-level BPE
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()

**Step 2: Train the Tokenizer**

Next, we need to train the tokenizer on our dataset. The trainer will take care of setting the special tokens and other configurations.


In [6]:
# Define the special tokens
special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]

# Setup a trainer
trainer = trainers.BpeTrainer(
    vocab_size=50257,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    special_tokens=special_tokens,
)

# Train the tokenizer
tokenizer.train_from_iterator(dataset["text"], trainer=trainer)






In the above code, we first define the special tokens used by GPT-2. Then, we setup a trainer with a specified vocabulary size, initial alphabet, and special tokens. Finally, we train the tokenizer on our dataset using the `train_from_iterator` method.

**Step 3: Save the Tokenizer**

After the tokenizer is trained, it's a good practice to save it for later use.


In [8]:
# Save the tokenizer
tokenizer.save("../tmp/gpt2_tokenizer.json")

In [10]:
from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast

# Load the trained tokenizer
tokenizer_obj = Tokenizer.from_file("../tmp/gpt2_tokenizer.json")

# Create a PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer_obj,
    unk_token="[UNK]",
    sep_token="[SEP]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    mask_token="[MASK]",
)
tokenizer.save_pretrained("../tmp/gpt2_uncased")

('../tmp/gpt2_uncased/tokenizer_config.json',
 '../tmp/gpt2_uncased/special_tokens_map.json',
 '../tmp/gpt2_uncased/tokenizer.json')

We have now successfully created and trained a tokenizer from scratch using our Wiki dataset. This tokenizer will be used to preprocess our dataset into a format that our GPT-2 model can understand.

In the next section, we will format our dataset for the GPT-2 model.

## Formatting the Data

Now that we have our tokenizer ready, the next step is to prepare our dataset in a format suitable for training our GPT-2 model. This usually involves tokenizing the text and organizing it into sequences of a fixed length.

**Step 1: Tokenize the Text**

We will use our newly trained tokenizer to encode the text in our dataset. This will convert the text into sequences of tokens, where each token is replaced by its corresponding ID in the tokenizer's vocabulary.


In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("../tmp/gpt2_uncased")
print(f"is_fast: {tokenizer.is_fast}")

is_fast: True


In [23]:
# Tokenization
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=False, padding=False)


tokenized_dataset = dataset.map(
    tokenize_function, batched=True, remove_columns=["text"]
)

Map:   3%|▎         | 21000/645867 [00:53<21:02, 494.98 examples/s]

In the above code, we define a function that tokenizes a batch of examples, and then apply this function to our dataset using the `map` method.

**Step 2: Format the Dataset**

Next, we need to organize our tokenized dataset into sequences of a fixed length. We will also shift the labels by one position so that the model can predict the next token given the current one, as required for causal language modeling.


In [None]:
# Formatting
block_size = 128  # or any number suitable to your context


def group_texts(examples):
    # Concatenate all 'input_ids'
    concatenated_examples = sum(examples["input_ids"], [])
    total_length = len(concatenated_examples)
    # Organize into sequences of fixed length
    sequences = [
        concatenated_examples[i : i + block_size]
        for i in range(0, total_length, block_size)
    ]
    result = {
        "input_ids": sequences,
        # Shift the labels for CLM
        "labels": [sequence[1:] + [tokenizer.eos_token_id] for sequence in sequences],
    }
    return result


tokenized_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    batch_size=1000,  # or any number suitable to your context
)

We now have our dataset properly formatted for training our GPT-2 model.

In the next section, we will define our GPT-2 model configuration.

## Defining the Model

In this section, we will define the configuration for our GPT-2 model. This involves specifying the architecture of the model and its hyperparameters.

**Step 1: Import Necessary Modules**

We will need to import the GPT-2 model and its configuration from the `transformers` library.


In [None]:
from transformers import GPT2LMHeadModel, GPT2Config

**Step 2: Define the Configuration**

We will define the configuration for our GPT-2 model. We need to specify the number of heads, the number of layers, the model size, and the size of the vocabulary among other things.


In [None]:
# Define configuration
config = GPT2Config(
    vocab_size=tokenizer.get_vocab_size(),
    bos_token_id=tokenizer.token_to_id("[CLS]"),
    eos_token_id=tokenizer.token_to_id("[SEP]"),
    pad_token_id=tokenizer.token_to_id("[PAD]"),
    n_positions=1024,
    n_ctx=1024,
    n_embd=768,
    n_layer=12,
    n_head=12,
)

In the above code, we have defined the configuration for a small GPT-2 model with 12 layers and 12 attention heads. The model has an embedding size of 768, and we've set the maximum number of positional embeddings to 1024.

**Step 3: Initialize the Model**

With the configuration defined, we can now initialize our GPT-2 model.


In [None]:
# Initialize the model
model = GPT2LMHeadModel(config)

This completes the definition of our model. In the next section, we will train this model on our prepared dataset.

## Training the Model

In this section, we will train our GPT-2 model on the prepared Wiki dataset. Training a model involves defining a training loop where the model's parameters are updated to minimize the loss on the training data.

**Step 1: Define Training Arguments**

We first need to define the training arguments which specify the training parameters such as the number of epochs, the batch size, and the learning rate among others.


In [None]:
from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="../tmp/gpt2",
    overwrite_output_dir=True,
    num_train_epochs=1,  # Adjust as necessary
    per_device_train_batch_size=32,  # Adjust as necessary
    save_steps=10_000,
    save_total_limit=2,
    log_dir="../tmp/gpt2_logs",
    fp16=True,
)

In the above code, we set the output directory for our model and specify that we want to overwrite any existing output in this directory. We set the number of training epochs to 1 and the training batch size to 32. These values can be adjusted based on the available computational resources. We also specify that we want to save our model every 10,000 steps, and we want to keep a maximum of 2 saved models on disk.

**Step 2: Define a Trainer**

We will use Hugging Face's `Trainer` class to handle the training process. We need to provide it with our model, the training arguments, and our training dataset.


In [None]:
# Define a trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

**Step 3: Train the Model**

Now that we have defined our trainer, we can start the training process.


In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}
device = accelerator.device

print(f"device: {device}")

trainer = accelerator.prepare(trainer)

trainer.train()

This will start the training process. The trainer will automatically handle the batching of the data, the updating of the parameters, and the computation of the loss.

In the next section, we will validate the performance of our model.

## Validating the Model

In this section, we will validate the performance of our GPT-2 model. This involves using the model to generate some text and evaluating how coherent and grammatically correct the generated text is.

**Step 1: Define a Generation Function**

We first need to define a function that uses the model to generate some text given an input prompt.


In [None]:
from transformers import pipeline


def generate_text(prompt):
    generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
    return generator(prompt, max_length=100, do_sample=True, temperature=0.9)

In the above code, we use the `pipeline` function from the `transformers` library to create a text generation pipeline with our model and tokenizer. We set the `max_length` parameter to 100, which means that the generated text will not be longer than 100 tokens. The `do_sample` parameter is set to True, which means that the function will sample from the distribution of the next token rather than always choosing the most likely token. The `temperature` parameter controls the randomness of the sampling, with higher values leading to more random outputs.

**Step 2: Generate and Evaluate Text**

We can now use our function to generate some text and evaluate its quality.


In [None]:
# Generate some text
generated_text = generate_text("The history of artificial intelligence began in the")

# Print the generated text
print(generated_text[0]["generated_text"])

This will print the generated text. You should manually evaluate the text and check for coherence and grammar.

Remember that this is a qualitative evaluation and is subject to personal judgment. A more quantitative evaluation could involve computing the perplexity of the model on a held-out validation set, although this would require a labeled dataset.

In the next section, we will save and load the model.

## Saving and Loading the Model

After training and validating the model, it is crucial to save the model's weights and configuration. This allows you to load the model later without the need to retrain it.

**Step 1: Save the Model**

We can use the `save_model` method of the `Trainer` class to save the model.


In [None]:
# Save the model
trainer.save_model("../tmp/gpt2")

This will save the model's weights and configuration to the specified directory.

**Step 2: Save the Tokenizer**

Remember to also save the tokenizer, as it is crucial for preprocessing the input data in the same way as during training.


In [None]:
# Save the tokenizer
tokenizer.save_pretrained("../tmp/gpt2")

**Step 3: Load the Model**

In a new session, we can load the model and the tokenizer using the `from_pretrained` method.


In [None]:
# Load the model
model = GPT2LMHeadModel.from_pretrained("../tmp/gpt2")

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("../tmp/gpt2")

With these steps, we have successfully saved and loaded a trained GPT-2 model. You can now use this model to generate text, fine-tune it on a specific task, or even continue training it.
