Tokenization in NLP: From Basics to Advanced Techniques

Introduction

Thank you very much for having me, it's a pleasure to be here. Before we get started, can everyone hear me okay so that we are logistically fine? Perfect, alright. At any point in time, if you feel that the sound is breaking or something like that, please put that in the chat. I should be able to see your chat as we proceed, but there’ll be someone from the team who will monitor.

We just have one hour or so, and I think this session is being recorded. In case you miss out on something, you can always rewatch and reach out to me later on over email or LinkedIn, whatever makes you more comfortable.

Before we get started, I just thought to give you some pointers on what we are going to talk about so that you know what we are going to discuss in the next hour. My objective here is to give you a brief rundown about embeddings and how embeddings are useful in the context of large language models but also in different techniques like RAG. If I get some time, I will cover that, otherwise, maybe we can host another webinar. There is a lot of content on RAG on Data Domain Dojo., which you may like to refer to.

I’ll share all the content that I’m going to present. It's mostly a GitHub repo and the notebook. I will also advise you to go through a couple of books which I have used for this particular demo and all the materials. With that being said, if you have any questions, feel free to ask. You can take it along the way or cover it after the event.

So, let's get started. I am going to show you what I’ve learned and developed while reading the book Building Large Language Models from Scratch by Sebastian. He is a very famous author around machine learning, and I’ve been following his book and his DL courses at university for quite some time. I would strongly recommend you to go over that.

Building Large Language Models

What we are going to do now is see how you can create embeddings from scratch. We will also learn few techniques that we use to train any large language model. I will walk you through the code as we go along to make the best use of this one hour.

First, let's import torch. We are using PyTorch for this, but TensorFlow or any other framework can be used. Additionally, we are using Tik for tokenization. This is important and will be covered later.

The entire lifecycle of building a large language model from scratch is:

Data preparation
Building the large language model
Training the model
Model evaluation
Fine-tuning
Deploying the fine-tuned model, either for classification, personal assistant, or any text-related tasks.

We are primarily focusing on text today, as embeddings for images or videos differ slightly. Assume all our tasks are text-based for today.

Why Embedding?

Large language models cannot comprehend raw data like text, video, or audio. Therefore, we need to convert this data into an array of numbers. Our aim today is to see how we can convert data into this format and why we are doing this because numbers and arrays are what large language models understand.

Let's take a simple example. In practice, this turns out to be of very high dimensions. For example, when projected into two dimensions, words with similar contexts will be close to each other. The angle between vectors which represent these words is very small, leading to high cosine similarity. This helps encode the context of the words.

Step-by-Step Tokenization Process

Using a text file, our simple tokenizer converts the data until all the tokens are unique. All non-word characters could be used as a token delimiter. This creates a set of vocabulary tokens and maps them to integers.

Once we have the vocabulary with unique tokens mapped to integers, there is a need to add special tokens for unknown characters and end of text to handle out-of-context words and text to allow multiple documents.

Subword Text Encoding with Byte Pair Encoding

With byte pair encoding (BPE), we create tokens that are practical in size. Using the slide-by-slide illustration taken from IIT Madras Professor's lecture notes, we find the most frequent byte pairs and merge operations to optimize our vocabulary size.

GPT2 is used to demonstrate tokenizers based on BPE. In GPT2, the vocabulary size is nearly 50,000 tokens, showing its comprehensiveness and the number of merges it went through.

Preparing Data for Training

Here we prepare our data by generating token ID pairs for training using a sliding window approach. We take a context size, perform tokenization, and generate valid token pairs for language models to learn to predict the next word.

Embedding Tokens

Embedding layers transform our tokens into numerical arrays, augmented with positional information to retain the order of words, using positional encodings. A scientific approach using sine and cosine functions improves context by maintaining word position.

Resources

The complete implementation is on the GitHub repo, and I encourage you to look at my blog post on semantic search with embeddings and more practical applications with data on Bedrock for advanced learning.

Keywords

Tokenization
Embeddings
Large Language Models
Byte Pair Encoding (BPE)
Positional Embeddings
GPT2
PyTorch
Data Preparation

FAQ

Q1: Why do we need to convert text into arrays of numbers? A1: Large Language Models cannot understand raw data like text, video, or audio. Instead, they work with arrays of numbers that capture the context better.

Q2: What is byte pair encoding, and why is it useful? A2: Byte pair encoding is a technique to convert text into subword tokens, optimizing vocabulary sizes, making handling large datasets efficient without losing context.

Q3: How do positional embeddings work? A3: Positional embeddings add context to words by maintaining their order in the sentence using sine and cosine functions ensuring the location of words in text is preserved.

Q4: How is the vocabulary size determined? A4: The vocabulary size depends on the need for representation accuracy and computational limits. Larger vocabularies improve context understanding but require more resources.

Q5: Where can I learn more about practical applications of embeddings? A5: You can explore my GitHub repo for practical implementations and read my blog post on semantic search using embeddings.