How to train an LLM using InstructLab

How to Train an LLM Using InstructLab

Hey everyone, this is Grant. Today, I want to show you a quick demo of our InstructLab project. By the end of the demo, you should have enough information to see how InstructLab can help you at your current job, with your company, or with your open-source project.

Step 1: Serving the Large Language Model

The first thing we’re going to do is serve the large language model. In this case, we're using the open-source Apache 2.0 licensed Granite model. To serve this model, we’ll use the iLab command line tool, pass in the model path, and point it to the model we want to begin serving—specifically, the 7 billion parameter quantized version of it.

Quantized models are compressed formats. Think of it like taking high-quality photos with a Canon DSLR camera and later sending compressed JPEG versions to your friends. You wouldn’t use these compressed images for print ads, but they’re sufficient for casual sharing and testing. Similarly, the quantized model is good for development but not for production.

iLab serve --model-path path/to/granite-model

Step 2: Chatting with the Model

To interact with the model, we set up a chat interface using the command:

iLab chat -M path/to/granite-model

This allows us to start a conversation. For instance, when asking "What is the InstructLab project?", the model might give a completely inaccurate but confident answer. This is known as a hallucination in the LLM space.

Step 3: Creating a Taxonomy

To correct this, we'll create a taxonomy—a directory structure with files that teach the model specific knowledge. Here’s a look inside our taxonomy file:

taxonomy/
  knowledge/
    instruct-lab-overview/
      q_and_a.txt

Our q_and_a.txt file consists of simple question-and-answer pairs written in plain English:

Q: What is InstructLab?
A: InstructLab is an open-source community and driven initiative...

Q: How do you get started with InstructLab?
A: You can get started by...

Additionally, we include a README file from GitHub to back up the knowledge. This information will be used for training the model.

Step 4: Generating Synthetic Data

Next, we generate synthetic data with:

iLab generate num_instructions --count 10

This uses the large language model to create additional Q&A pairs based on the initial examples and the README file. These pairs will be used to train the model, enhancing its knowledge and reducing hallucinations. There’s a critic model in place to ensure the generated data is accurate.

Step 5: Training the Model

Once we have the synthetic data, we train the model with:

iLab train

This process integrates the new knowledge into the model.

Step 6: Serving the Trained Model

We then serve the newly trained model:

iLab serve --model-path path/to/new-trained-model

Step 7: Verifying and Utilizing the Model

Restart the chat interface to verify the new knowledge:

iLab chat -M path/to/new-trained-model

Ask specific questions to ensure the model has learned correctly:

Q: What is the InstructLab project?

The newly trained model should provide more accurate and detailed answers.

Additionally, you can use the model for various tasks such as drafting emails, which are now informed by the additional knowledge it has been trained on.

For example:

Help me write an email to my boss John asking to spend a full day investigating the InstructLab project and highlight the benefits to our company from implementing the InstructLab project.

Keywords

InstructLab
Granite model
Large Language Model (LLM)
Quantized Model
Synthetic Data Generation
Model Training
Taxonomy
Hallucination

FAQ

1. What is InstructLab? InstructLab is an open-source community-driven initiative to build the next generation of generative AI models.

2. What is a quantized model? A quantized model is a compressed format of a large language model, useful for development but not recommended for production.

3. What is synthetic data generation? Synthetic data generation uses an LLM to generate additional examples based on supplied question-and-answer pairs and backup data.

4. How do I train a model using InstructLab? You can train a model using InstructLab by creating a taxonomy of knowledge, generating synthetic data, and then running the training commands provided by InstructLab.

5. What is a taxonomy in this context? A taxonomy in this context refers to a directory structure containing files with question-and-answer pairs and other backup data, used to train the model on specific knowledge.

6. What happens if the model gives inaccurate answers? The critic model in InstructLab helps to filter out hallucinations, but human oversight is typically required to ensure data accuracy before finalizing the training process.

How to train an LLM using InstructLab