:::info
Authors:
(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);
(2) Léo Tronchon, Hugging Face (the order was chosen randomly);
(3) Matthieu Cord, Sorbonne Université;
(4) Victor Sanh, Hugging Face.
:::
Table of Links
Abstract and 1 Introduction
2 Terminology
3 Exploring the design space of vision-language models and 3.1 Are all pre-trained backbones equivalent for VLMs?
3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?
3.3 Where are the efficiency gains?
3.4 How can one trade compute for performance?
4 Idefics2 – an open state-of-the-art vision-language foundation model and 4.1 Multi-stage pre-training
4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios
5 Conclusion, Acknowledgement, and References
A Appendix
A.1 Further experimental details of the ablations
A.2 Details of the instruction fine-tuning
A.3 Details of the evaluations
A.4 Red-teaming
Abstract
The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.
1 Introduction
Vision-language models (VLMs) that take images and texts as inputs and output texts, are useful for many tasks, like retrieving information in a scanned PDF (Hu et al., 2024), explaining charts or diagrams (Carbune et al., 2024), transcribing the text in an image (Blecher et al., 2023), counting objects in a picture (Goyal et al., 2017) or turning screenshots of webpages into code (Laurençon et al., 2024). The development of powerful open large language models (Touvron et al., 2023; Jiang et al., 2023; Google, 2024b) and image encoders (Zhai et al., 2023; Sun et al., 2023; Radford et al., 2021) enables researchers to build upon these unimodal pre-trained models to create advanced VLMs that solve these problems with increasing accuracy (Dai et al., 2023; Liu et al., 2023; Bai et al., 2023; Lin et al., 2024, 2023; Li et al., 2024; Wang et al., 2024). Despite the progress in the field, the literature reveals many disparate design choices which are often not justified experimentally, or very briefly.
This situation makes it challenging to distinguish which decisions truly account for model performance, thereby making it difficult for the community to make meaningful and grounded progress. For instance, (Alayrac et al., 2022; Laurençon et al., 2023) use interleaved Transformer-based crossattentions to fuse the image information into the language model, while (Li et al., 2023; Liu et al., 2023) concatenate the sequence of image hidden states with the sequence of text embeddings, and feed the concatenated sequence to the language model. To our knowledge, this choice has not been properly ablated, and trade-offs in terms of compute, data efficiency and performance are poorly understood. In this work, we aim to bring experimental clarity to some of these core design choices and pose the question: What matters when building vision-language models?
We identify two areas where various works adopt different design choices: (a) model architecture, and in particular, connector modules that fuse the vision and text modalities and their impact on inference efficiency, (b) multimodal training procedure and its impact on training stability. For each of these areas, we rigorously compare different design choices in a controlled environment and extract experimental findings. Notably, we find that (a) the progress of vision-language models is in large part driven by the progress of pre-trained unimodal backbones, (b) the more recent fully autoregressive architecture outperforms the cross-attention architecture, although it requires modifications to the optimization procedure to ensure a stable training, (c) adaptation of the pre-trained vision backbone and the modules connecting the text and vision modalities allow for more efficiency at inference time on one side, and handling images in their original ratio and size without harming downstream performance on the other side, and (d) modifications to the image processing enables trading inference cost for downstream performance.
Our results are complementary with those presented in (Karamcheti et al., 2024; McKinzie et al., 2024; Lin et al., 2024) which derive insights about multi-stage training, selective unfreezing of the pre-trained backbones, data repetition, and impact of training mixture on zero and few-shot performance. We specifically delve into unexplored aspects such as model architecture, training methods, stability, and efficiency improvements at inference.
Learning from these insights, we train Idefics2, a foundational VLM with 8 billion parameters. Idefics2 achieves state-of-the-art performance in its size category on various benchmarks while being more efficient at inference, for both the base and the fine-tuned version. It is on par with state-ofthe-art models 4 times larger on some vision-language benchmarks and matches the performance of Gemini 1.5 Pro on some challenging benchmarks. We release the base, instructed, and chat versions of Idefics2[1] as resources for the VLM community along with the data created to train the model.
:::info
This paper is available on arxiv under CC BY 4.0 DEED license.
:::
[1] https://huggingface.co/collections/HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe