Researchers from several Chinese institutions fine-tuned Llama-3.2-11B-Vision-Instruct to improve its ability to solve multimodal reasoning problems by going beyond the direct-response or chain-of-thought (coT) approaches to reason step by step in a structured way. Named LLava-CoT, the new model outperforms its base model and proves better than larger models, including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct, on a number of benchmarks.
According to the Chinese researchers, one reason why visual language models (VLMs) often hallucinate or produce error is the lack of systematic and structured reasoning:
Specifically, by referring systematic, we mean that the model does not generate a direct reasoning chain but instead engages in multistage reasoning. Structured, on the other hand, refers to the model’s ability to clearly identify the reasoning stage it is in and understand the primary task to be addressed at each stage.
The approach taken by the research authors consists in designing LLaVA-CoT so it reasons through four stages: a summary, where the model summarized the current task; a caption, which is a description of the relevant parts of an image; reasoning, where the model analyzes the question; and conclusion, which provides a final response based on the reasoning stage. In other words, the model first organizes the problem and all known information, then it carries through a detailed thought process, and finally derives a conclusion.
To make this possible, the researchers constructed a specific dataset, LLaVA-o1-100k, by using GPT-4o to generate responses stage by stage. The custom dataset includes data from both general-purpose visual question answer (VQA) datasets as well as science-targeted VQA datasets. They used then the generated dataset to perform a full parameter fine-tuning of Llama-3.2-11B-Vision-Instruct in a supervised approach.
Additionally, LLaVA-CoT uses a novel approach to efficient inference time scaling. Instead of using beam search at the sentence level, they use it at the stage level to generate multiple candidate results at each stage. The best potential result is then selected to continue the generation process at the next stage. According to the authors, using inference time scaling makes it possible for the model to arrive at a concrete answer during the reasoning process and retain it for the final stage. Lacking this, the model could need to make a guess for the final stage, possibly leading to incorrect results.
Stage-level beam search, which is made possible by the structured output design of [LLaVA-CoT], is an effective and powerful approach for inference time scaling.
To assess their approach, the researchers compared LLaVA-CoT performance to both its base model and other models. They found LLaVA-CoT provides notable improvements across general VQA, mathematical reasoning, scientific VQA, and hallucination control tasks in comparison to its base model. Additionally, LLaVA-CoT appears to outperform many open-source models of similar or even larger sizes, such as InternVL2-8B, Ovis1.5-Gemma2-9B, MiniCPM-V2.6-8B, Llama-3.2-90B-Vision-Instruct, and VILA-1.5-40B, as well as closed-source models such as GPT-4o-mini and Gemini-1.5-pro.
LLaVA-Cot is available on Hugging Face, while the LLaVA-o1-100k dataset will be made public in future, say the authors. A Web app is also available which allows to upload an image and start chatting about it.