By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Can Smaller AI Outperform the Giants? | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Can Smaller AI Outperform the Giants? | HackerNoon
Computing

Can Smaller AI Outperform the Giants? | HackerNoon

News Room
Last updated: 2025/06/15 at 7:46 AM
News Room Published 15 June 2025
Share
SHARE

:::info
Authors:

(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);

(2) Léo Tronchon, Hugging Face (the order was chosen randomly);

(3) Matthieu Cord, Sorbonne Université;

(4) Victor Sanh, Hugging Face.

:::

Table of Links

Abstract and 1 Introduction

2 Terminology

3 Exploring the design space of vision-language models and 3.1 Are all pre-trained backbones equivalent for VLMs?

3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?

3.3 Where are the efficiency gains?

3.4 How can one trade compute for performance?

4 Idefics2 – an open state-of-the-art vision-language foundation model and 4.1 Multi-stage pre-training

4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios

5 Conclusion, Acknowledgement, and References

A Appendix

A.1 Further experimental details of the ablations

A.2 Details of the instruction fine-tuning

A.3 Details of the evaluations

A.4 Red-teaming

Abstract

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

1 Introduction

Vision-language models (VLMs) that take images and texts as inputs and output texts, are useful for many tasks, like retrieving information in a scanned PDF (Hu et al., 2024), explaining charts or diagrams (Carbune et al., 2024), transcribing the text in an image (Blecher et al., 2023), counting objects in a picture (Goyal et al., 2017) or turning screenshots of webpages into code (Laurençon et al., 2024). The development of powerful open large language models (Touvron et al., 2023; Jiang et al., 2023; Google, 2024b) and image encoders (Zhai et al., 2023; Sun et al., 2023; Radford et al., 2021) enables researchers to build upon these unimodal pre-trained models to create advanced VLMs that solve these problems with increasing accuracy (Dai et al., 2023; Liu et al., 2023; Bai et al., 2023; Lin et al., 2024, 2023; Li et al., 2024; Wang et al., 2024). Despite the progress in the field, the literature reveals many disparate design choices which are often not justified experimentally, or very briefly.

This situation makes it challenging to distinguish which decisions truly account for model performance, thereby making it difficult for the community to make meaningful and grounded progress. For instance, (Alayrac et al., 2022; Laurençon et al., 2023) use interleaved Transformer-based crossattentions to fuse the image information into the language model, while (Li et al., 2023; Liu et al., 2023) concatenate the sequence of image hidden states with the sequence of text embeddings, and feed the concatenated sequence to the language model. To our knowledge, this choice has not been properly ablated, and trade-offs in terms of compute, data efficiency and performance are poorly understood. In this work, we aim to bring experimental clarity to some of these core design choices and pose the question: What matters when building vision-language models?

We identify two areas where various works adopt different design choices: (a) model architecture, and in particular, connector modules that fuse the vision and text modalities and their impact on inference efficiency, (b) multimodal training procedure and its impact on training stability. For each of these areas, we rigorously compare different design choices in a controlled environment and extract experimental findings. Notably, we find that (a) the progress of vision-language models is in large part driven by the progress of pre-trained unimodal backbones, (b) the more recent fully autoregressive architecture outperforms the cross-attention architecture, although it requires modifications to the optimization procedure to ensure a stable training, (c) adaptation of the pre-trained vision backbone and the modules connecting the text and vision modalities allow for more efficiency at inference time on one side, and handling images in their original ratio and size without harming downstream performance on the other side, and (d) modifications to the image processing enables trading inference cost for downstream performance.

Our results are complementary with those presented in (Karamcheti et al., 2024; McKinzie et al., 2024; Lin et al., 2024) which derive insights about multi-stage training, selective unfreezing of the pre-trained backbones, data repetition, and impact of training mixture on zero and few-shot performance. We specifically delve into unexplored aspects such as model architecture, training methods, stability, and efficiency improvements at inference.

Learning from these insights, we train Idefics2, a foundational VLM with 8 billion parameters. Idefics2 achieves state-of-the-art performance in its size category on various benchmarks while being more efficient at inference, for both the base and the fine-tuned version. It is on par with state-ofthe-art models 4 times larger on some vision-language benchmarks and matches the performance of Gemini 1.5 Pro on some challenging benchmarks. We release the base, instructed, and chat versions of Idefics2[1] as resources for the VLM community along with the data created to train the model.

:::info
This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[1] https://huggingface.co/collections/HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article The Mysterious Inner Workings of Io, Jupiter’s Volcanic Moon
Next Article Android Instant Apps Heads to the Google Graveyard This December
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Microsoft 365 Just Got Pricier, So Grab This Office 2024 License Instead
News
This new on Netflix action thriller soars to No. 1 spot — and it lives up to its 94% Rotten Tomatoes audience score
News
Linux 6.16-rc2 Released With An Initial Batch Of Fixes
Computing
Score a Samsung Galaxy A16 smartphone for $175— its lowest price yet
News

You Might also Like

Computing

Linux 6.16-rc2 Released With An Initial Batch Of Fixes

2 Min Read
Computing

China and EU nearing agreement on import tariffs on Chinese EVs: report · TechNode

1 Min Read
Computing

18 of the Best Design Apps For Creating Gorgeous Instagram Stories

24 Min Read
Computing

Top 15 Replit Alternatives for Cloud Development in 2025 |

37 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?