IBM and Red Hat recently introduced InstructLab, a new AI training method for large language and code models. InstructLab, which addresses drawbacks that previously hampered the development of these models, uses a process based on an April 2024 research paper by members of the MIT-IBM Watson AI Lab and IBM. In addition to being open and model-agnostic, InstructLab increases the performance of open-source models and overcomes scaling challenges seen in traditional LLM training.
One of its unique features is that InstructLab puts LLM development into the hands of the open-source developer community. Just as open-source software allows developers to collectively contribute code, merge changes and rapidly iterate on a software program, InstructLab will allow developers to collectively contribute new skills and knowledge to any LLM, rapidly iterating and merging contributions together to enable one model that is improved by the entire community.
This community-style approach and merging of contributions is only possible through IBM’s novel Large-scale Alignment for chatBots, or LAB, instruction-tuning method. This method uses taxonomy-guided synthetic data generation and a novel multiphase tuning framework to assimilate new knowledge and capabilities into a foundation model without overwriting what the model has already learned.
Community Contributions Improve Base Models
Before we get into the specifics of InstructLab, it’s important to provide some context for the technology. Currently, most large language models—the popular type of foundation model behind much of today’s generative AI boom—are created by training them with large amounts of diverse data, including documents, code, JSON files, books and other sources of information. If a model’s performance is unsatisfactory after its training phase, then more data can be added and additional training can be done until it achieves the desired performance.
Specialized generative AI models are created by fine-tuning pre-trained LLMs with smaller datasets tailored to specific use cases. Adjusting the model’s weights (i.e., how it prioritizes certain topics or types of information) may also be necessary to achieve the required level of specialization. The potential downside is that, even though fine-tuning improves performance for a particular subject, the additional tuning may dilute a model’s general knowledge and applicability.
A fine-tuned model consists of a copy of the original foundation model plus the specialized generative AI model. Even though each generative AI model contains a copy of the LLM, those copies are now specialized for a unique use case, minus some of the LLM’s original general knowledge. This means that covering multiple use cases requires a foundation model to be fine-tuned for each of them. Managing multiple models is complicated and costly, requiring additional monitoring, maintenance and updates.
The recent release of Llama 3 is a good example of how multiple models are needed for different use cases. Within a few weeks after Llama 3 was released, more than 6,000 forks of it had appeared on Hugging Face. Because forks rarely merge with the base model, considerable community efforts like those in this example don’t ultimately improve base models.
By taking a different approach, InstructLab eliminates the need to build and maintain multiple models. It can turn a single foundation model into a collaborative effort that merges new knowledge and skills back into the base model.
How InstructLab Data Is Organized
InstructLab starts with selecting a trustworthy open-source base model, such as an IBM Granite model in watsonx or Hugging Face. To give the model the desired skills and capabilities, InstructLab adds specific domain knowledge and skills to that model.
InstructLab’s data is organized in a tree structure—as shown in the diagram above—consisting of three main categories that define what the model will learn. By using selective information, developers can control the model’s expertise and capabilities.
The way data is organized in InstructLab is called taxonomy. In this structure, each layer of data is defined as a node. These are the three categories in the InstructLab taxonomy:
- Knowledge data is divided into document types such as subject matter books, textbooks, technical instructions and manuals.
- Foundational skills include math, coding, language and reasoning skills that the model needs for more knowledge acquisition. This information is readily available in public databases.
- Compositional skills relate to jobs or questions requiring knowledge and foundational skills. Complex tasks require multiple skills that combine areas of deep technical knowledge with cognitive skills. For example, an AI stock market tool needs knowledge of finance, economic behavior and historical trends. It also needs foundational skills in mathematics and statistical analysis.
Instructions for adding skills and knowledge to four open-source foundation models can be found on the InstructLab community page.
Synthetic Data
InstructLab is primarily a synthetic-data-generating pipeline. Data generation is central to InstructLAB and to AI in general. Synthetic data copies the statistical properties of real data. Creating large amounts of computer-generated diverse synthetic data is much cheaper than using human-annotated data produced in real-world situations. Besides the lower cost, synthetic data is particularly useful when high-quality real-world data is scarce. Synthetic data also allows iterative training to accommodate community contributions without overwriting existing learning.
Adding new knowledge or missing skills to an InstructLab model requires creating a new node (called a “leaf node” in the taxonomy). The leaf node contains a few human-generated examples of new skills needed for the model.
Step-By-Step InstructLab Model Creation
Putting it all together, here is a step-by-step process explaining how an InstructLab model is created. This process improves the model’s performance and incorporates new capabilities without causing the base model to lose general knowledge and without the need to fork thousands of base model versions.
- A curated taxonomy of knowledge and skills is assembled, tailored to the model’s needs.
- A few human-generated examples are used to show the model what kind of instructions it should generate at scale
- A permissively licensed, safe, open-source LLM is selected for use as a “teacher” model. Its function is to use the human-generated examples to generate millions of question-and-answer samples for the taxonomies.
- A separate “critic” model (another role of the teacher model) analyzes the data for accuracy and quality. It also scans for prohibited material such as profanity or violence.
- By this point, the process will have created a clean dataset that fits the original prescription for the model’s capabilities. The new set of vetted synthetic questions and answers can be used to fine-tune the base model, first with new knowledge and then with new skills.
- After satisfying benchmarks for safety and utility, the model can be placed in service. Future updates to its skills and knowledge can be carried out through contributions by the community.
How InstructLab Stacks Up Against Traditional Methods
IBM announced the Granite-13b model in September 2023. Three months later, IBM researchers achieved a model-building breakthrough using a new alignment method—the one that became InstructLab—that significantly increased Granite 13b’s performance. The new method worked so well that IBM’s Granite 13b model (with 13 billion tokens) matched the performance of Meta’s large, high-performing Llama2 model (70 billion tokens).
Similar performance increases were obtained when IBM researchers applied the new alignment method to other IBM open-source Granite models. Once again, those models demonstrated higher scores for superior conversation and instruction-following abilities across various use cases.
Evolution Of Different Model Versions Using Identical Prompts
InstructLab has shown benefits in many areas. One example comes from comparing the output of different models using the same prompt. Running an identical prompt on three different models should result in three different outputs that can be compared to each other. With that test in mind, IBM researchers ran an identical pair of prompts on Granite-13b-chat-v1 and Granite-13b-chat-v2—each using traditional training—along with a model called Labrador that was created using the InstructLab methodology.
These two prompts were used:
- What does the company IBM do?
- Re-write your response in the style of a New York gangster from the 70’s
Each Granite version generated a unique response to the prompt. The original Granite-13b-chat-v1 responded with a simple, one-sentence answer. Granite-13b-chat-v2 was more advanced and provided a three-sentence response.
However, the Labrador version—the only version created with InstructLab—produced an outcome that was 25 sentences long and imitated a streetwise gangster persona using a distinctive stylized tone. It used phrases like “big time,” “got the juice” and “not afraid to tell you” to describe IBM in a gangster’s cadence.
The InstructLab technique clearly created a differentiated and more interesting version compared to the two models trained with standard methods. Besides fulfilling the requirements laid out in the pair of prompts, the InstructLab-created model gave a pretty accurate—and amusing—take on IBM itself.
IBM Code Models
While LLMs get much of the press in the current generative AI boom, other types of models are also important. Further validating IBM’s approach, InstructLab has produced excellent results when used with code models, as shown in the chart above. Using Meta’s Octoback benchmark tests, which cover eight diverse tasks, smaller IBM Granite models edged out Code Llama models up to twice their size. This suggests another category of real-world use in which InstructLab could save model-builders significant time and money.
A Better Way To Iterate LLMs
The InstructLab project represents a breakthrough for aligning pre-trained LLMs by putting the process into the hands of the open-source developer community. Models that are aligned under the InstructLab project will have the following differentiators:
- Improved performance
- Rapid iteration of models based on developer and community-based contributions
- Universal and standardized experience
Additionally, current training methods for LLMs require thousands of GPUs and months of training. InstructLab can add skills or knowledge to these models with a small number of GPUs, and retraining can be completed in less than a week. Initially, extra care must be used to screen community input for skills and knowledge because the process is new and largely untested. Establishing and maintaining high standards is critical for InstructLab’s long-term success.
With these caveats in mind, IBM’s goal is to release new versions of the InstructLab models weekly, similar to how open-source software is updated. These frequent releases will enhance base models through continuous improvement.
IBM worked closely with its Red Hat unit to develop and deliver its RHEL AI offering, which is built using IBM’s licensed Granite models that utilize InstructLab. IBM will rely on Red Hat to continue to build open-source communities for its RHEL AL offering.
Overall, I believe that InstructLAB brings a unique set of advantages to developer- and community-driven, open-source LLM development. InstructLab will make future methods for LLM development far more accessible and will allow models to evolve faster and solve a broader range of problems, which benefits everyone in the AI ecosystem.