Introducing LLaVA-Phi: A Compact Vision-Language Assistant Powered By A Small Language Model

Table of Links

Abstract and 1 Introduction

2. Related Work

3. LLaVA-Phi and 3.1. Training

3.2. Qualitative Results

4. Experiments

5. Conclusion, Limitation, and Future Works and References

Abstract

In this paper, we introduce LLaVA-ϕ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small language model, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller language models, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in timesensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller language models to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency. The project is available at https://github.com/zhuyiche/llava-phi.

1. Introduction

Large vision language models, including Flamingo [1], GPT-4V [30], and Gemini [33], have exhibited remarkable proficiency in executing instructions, engaging in multi-turn dialogues, and handling image-based question-answering tasks. The progression of open-source vision language models has been significantly propelled by the rapid advancement of open-source Large Language Models like LLaMA [34] and Vicuna [5]. These developments primarily focus on leveraging language models with a minimum of 7B parameters, integrated with a vision encoder to enhance visual comprehension. However, this approach often results in increased test time and reduced inference speed, which are less than ideal for time-sensitive or real-time interactive applications, such as autonomous driving and robotics. This leads to an important inquiry: How effectively can small vision-language assistants perform in comparison?

Gemini [33] has blazed a trail for multi-modal models in mobile technology. Its streamlined variant, Gemini-Nano, boasts 1.8/3.25 billion parameters and is deployable on mobile devices. However, details like the model architecture, training data, and training methodologies remain proprietary and inaccessible to the public. In the realm of small language models, there have been notable advancements: TinyGSM [23], with 2.6 billion parameters, achieves over 80% accuracy on the GSM8k [7] benchmark. Additionally, models such as Phi [13] have demonstrated capabilities in language understanding, commonsense reasoning, and code generation, rivaling larger language models like LLaMA2-7B. This progress underscores the significant strides being made in the efficiency and effectiveness of smaller-scale language models.

In this paper, we introduce LLaVA-Phi, a compact vision-language assistant powered by a small language model. Our work combines the powerful opensourced multi-modal model, LLaVA-1.5 [24], with the best-performing open-sourced small language models, Phi2 [21]. We follow a two-stage training pipeline and leverage high-quality visual instruction tuning data from LLaVA. LLaVA-Phi was evaluated across eight diverse benchmarks. Despite possessing only 3 billion parameters, it achieves performance comparable to, or even surpassing, some larger multi-modal models that are three times larger.

Notably, LLaVA-Phi-3B demonstrates exceptional proficiency in ScienceQA [28], outperforming existing large multimodal models. Additionally, we qualitatively demonstrate LLaVA-Phi’s strong generalization ability in handling challenging questions, generating code based on instructions, and solving mathematical problems.

(1) Yichen Zhu, Midea Group;

(2) Minjie Zhu, Midea Group and East China Normal University;

(3) Ning Liu, Midea Group;

(4) Zhicai Ou, Midea Group;

(5) Xiaofeng Mou, Midea Group.

Introducing LLaVA-Phi: A Compact Vision-Language Assistant Powered By a Small Language Model | HackerNoon

Table of Links

Abstract

1. Introduction

Leave a Reply

Table of Links

Abstract

1. Introduction

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply