InstructLab.ai is the open-source implementation of the large-scale alignment for chatbots(LAB) concept described in the paper. According to the paper’s abstract, LAB intends to overcome the scalability challenges in the instruction-tuning phase of a large language model (LLM). Its approach leverages a synthetic data-based alignment tuning method for LLMs. Crafted taxonomies allow this approach to deliver the synthesization seeds for training data.
The project promises to reduce the complexity and the cost of fine-tuning LLMs by reducing the reliance on human-annotated data or proprietary models. The approach allows users to tweak AI models without prior ML expertise or robust infrastructure. At the core of this approach are crafted taxonomies that will enable it to deliver the synthetization seeds for training data. Using taxonomy-guided synthetic data generation and a multiphase tuning framework enables the assimilation of new knowledge and capabilities into a foundation model while retaining existing knowledge. Thus, it promises to be an effective solution for enhancing LLM capabilities without the drawbacks of catastrophic forgetting.
A taxonomy is the tree-like way the data is stored, with each layer being a knowledge node. There are three categories:
- Knowledge data – consisting of subject matter expertise like books, technical instructions and manuals
- Foundational skills – consisting of capabilities for additional knowledge acquisition like reasoning, math and coding skills
- Compositional skills – builds on the previous two, relating to jobs or questions requiring knowledge and foundational skills.
Knowledge nodes in the taxonomy tree consist of qna.yaml
files similar to those used for skills but with additional elements. To contribute knowledge, users must create a Git repository, such as the one hosted on GitHub, containing markdown files detailing their knowledge contributions. The qna.yaml
file includes parameters that draw information from these repositories, facilitating the integration of user-submitted content.
Image Source
Instructlab.ai intends to deliver these by leveraging the community’s skills and knowledge. The targeted audience is broad, as the input consists of just a few lines of YAML code in the qna.yaml
file and an attribution.txt for citing sources.
The training concept can be applied to any chat model, regardless of origin. The InstructLab Granite-7b models are publicly available under an open-source Apache 2.0 license.
Image Source
InstructLab regularly retrains its models using organised user contributions, improving capabilities through continuous community-driven updates.
InfoQ sat with Leslie Hawthorn, director of industry community strategy at Red Hat Open Source Program Office, and Máirín Duffy, InstructLab’s engineering leader, to learn more about the project’s impact.
Hawthorn: I am delighted to see the potential for community-driven innovation that InstructLab presents. The possibilities are endless – from fine-tuning existing models for specific domains to creating new ones that tackle complex problems like language understanding or text generation. And who knows? Maybe someone will come up with a game-changing breakthrough that revolutionises the field! I’m excited to participate in this journey and see what the community comes up with.
Duffy: We aim to drive adoption of our tooling and model API standard, making it easier for developers to build upon and contribute to the ecosystem.
Although LLMs are accelerators in many industries, not every model is appropriate in every scenario. Adapting general-purpose models to particular cases can be costly and lengthy. InstructLab.ai pledges to democratise model customisation through its community-led process. Instructlab.ai published the first model built by its community just before Christmas.