Let's build GPT: from scratch, in code, spelled out.
Science & Technology
Introduction
Introduction
Hi everyone, By now, you’ve probably heard of ChatGPT. It has taken the world and the AI community by storm, with a system that allows you to interact with an AI and give it text-based tasks. For example, you can ask ChatGPT to write a small haiku about the importance of understanding AI and its potential to improve the world. Here's what the AI generated:
"AI knowledge brings, Prosperity for all to see, Embrace its power."
Although the outputs might vary slightly due to the probabilistic nature of ChatGPT, the essence remains the same. This article aims to delve under the hood of what makes ChatGPT tick, focusing on the neural network component, especially the Transformer architecture.
Transformer Architecture
Understanding the Basics
ChatGPT is underpinned by a Transformer, a neural network architecture that comes from the landmark paper, “Attention is All You Need” published in 2017. This model innovatively uses attention mechanisms to process text data and is at the core of ChatGPT.
Building a Simple Transformer
To comprehensively understand the workings of ChatGPT, this guide aims to build a basic character-level language model using the Transformer architecture. Instead of training on an extensive dataset, we will use a smaller dataset derived from Shakespeare's works. This example will serve as a foundation to understand how these models function.
Preprocessing
First, we need to preprocess our data. We read in the entire Shakespeare dataset and tokenize the text into characters. We create an encoding mechanism for translating strings of text into integers and vice versa, and split the entire dataset into training and validation sets.
tokens = open('input.txt').read()
vocab = list(sorted(set(tokens)))
token2idx = (c: i for i, c in enumerate(vocab))
idx2token = (i: c for i, c in enumerate(vocab))
train_data = tokens[:int(len(tokens)*0.9)]
val_data = tokens[int(len(tokens)*0.9):]
def encode(text):
return torch.tensor([token2idx[c] for c in text], dtype=torch.long)
def decode(indices):
return ''.join([idx2token[i] for i in indices])
train_data_idx = encode(train_data)
val_data_idx = encode(val_data)
Model Architecture
Embeddings
We start by creating token embeddings and position embeddings to provide each character with a vector that also captures its position in the sequence.
class BiGramModel(nn.Module):
def __init__(self):
super().__init__()
self.token_embedding = nn.Embedding(len(vocab), 32)
self.position_embedding = nn.Embedding(block_size, 32)
Self-Attention Mechanism
The self-attention mechanism is core to the Transformer architecture. It enables tokens in a sequence to interact with each other and gather contextual information efficiently.
class SelfAttention(nn.Module):
def __init__(self, embed_size):
super().__init__()
self.query = nn.Linear(embed_size, embed_size)
self.key = nn.Linear(embed_size, embed_size)
self.value = nn.Linear(embed_size, embed_size)
def forward(self, x):
Q = self.query(x)
K = self.key(x)
V = self.value(x)
attention_scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(Q.size(-1))
attention_weights = nn.Softmax(dim=-1)(attention_scores)
out = torch.bmm(attention_weights, V)
return out
Multi-Head Attention
Instead of a single self-attention mechanism, a Transformer uses multi-head attention to allow the model to focus on different parts of the sequence.
class MultiHeadAttention(nn.Module):
def __init__(self, embed_size, num_heads):
super().__init__()
self.heads = nn.ModuleList([SelfAttention(embed_size // num_heads) for _ in range(num_heads)])
self.linear = nn.Linear(embed_size, embed_size)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.linear(out)
return out
Training
We then proceed to train the model, where we define the optimizer and set up our training loop.
model = Transformer(embed_size, num_layers).to(device)
optimizer = optim.Adam(model.parameters(), lr=3e-4)
for epoch in range(num_epochs):
for batch in train_dataloader:
optimizer.zero_grad()
input, target = batch
out = model(input)
loss = criterion(out, target)
loss.backward()
optimizer.step()
with torch.no_grad():
val_loss = sum(criterion(model(batch[0]), batch[1]) for batch in val_dataloader)
print(f'Epoch: (epoch), Training Loss: (loss.item()), Validation Loss: (val_loss.item())')
Generation
Finally, we implement the text generation functionality by feeding initial tokens and generating subsequent tokens based on probabilities.
def generate(model, start_token, max_length):
model.eval()
generated_sequence = [start_token]
input_seq = torch.tensor([start_token], dtype=torch.long, device=device).unsqueeze(0)
for _ in range(max_length):
with torch.no_grad():
logits = model(input_seq)
next_token = torch.argmax(logits, dim=-1).item()
generated_sequence.append(next_token)
input_seq = torch.cat((input_seq, torch.tensor([[next_token]], device=device)), dim=1)
return decode(generated_sequence)
Conclusion
In summary, we embarked on a journey to understand the workings of GPT-like models, built a basic character-level Transformer, and imbibed essential concepts including embeddings, self-attention, multi-head attention, and token generation. While our model is miniature compared to real-world implementations like GPT-3, it should serve as a robust starter framework.
Keywords
- ChatGPT
- Transformer Architecture
- Self-Attention
- Multi-Head Attention
- Pre-training
- Fine-tuning
- Text Generation
FAQ
Q1: What is the purpose of the positional embeddings in the Transformer model?
A1: Positional embeddings introduce a notion of order in the sequence of tokens. Transformers, by design, do not have a sense of position, so positional embeddings help retain this crucial sequential information.
Q2: Why is the self-attention mechanism crucial in the Transformer architecture?
A2: The self-attention mechanism allows the model to weigh the importance of different tokens in a sequence context. This mechanism enables the model to capture relationships between tokens irrespective of their distance in the sequence.
Q3: What is the difference between single-head and multi-head attention?
A3: Single-head attention processes tokens using a single attention mechanism, while multi-head attention runs multiple attention mechanisms in parallel. This allows capturing richer and more diverse contextual information from the sequence.
Q4: How do residual connections help in training deep neural networks?
A4: Residual connections provide gradient superhighways which enable smoother and more effective backpropagation through deep networks, thus mitigating the vanishing gradients problem and facilitating better optimization.
Q5: What is the significance of dropout and layer normalization in the Transformer architecture?
A5: Dropout serves as a regularization technique that prevents overfitting by randomly disabling neurons during training. Layer normalization ensures that activations maintain a stable distribution, aiding in the effective training of deep models.