Google DeepMind released PaliGemma 2, a family of vision-language models (VLM). PaliGemma 2 is available in three different sizes and three input image resolutions and achieves state-of-the-art performance on several vision-language benchmarks.
PaliGemma 2 is an update of the PaliGemma family, which was released in 2024. It uses the same SigLIP-So400m vision encoder as the original PaliGemma, but upgrades to the Gemma 2 LLM. The PaliGemma 2 family contains nine different models, combining LLM sizes of 2B, 9B, and 27B parameters with vision encoders of 224, 448, and 896 pixels-squared resolution. The research team evaluated PaliGemma 2 on a variety of benchmarks, where it set new state-of-the-art records, including optical character recognition (OCR), molecular structure recognition, and radiography report generation. According to Google:
We’re incredibly excited to see what you create with PaliGemma 2. Join the vibrant Gemma community, share your projects to the Gemmaverse, and let’s continue to explore the boundless potential of AI together. Your feedback and contributions are invaluable in shaping the future of these models and driving innovation in the field.
PaliGemma 2 is a combination of a pre-trained SigLIP-So400m image encode and a Gemma 2 LLM. This combination is then further pre-trained on a 1B example multimodal dataset. Besides the pre-trained base models, Google also released variants that were fine-tuned on the Descriptions of Connected and Contrasting Images (DOCCI) dataset, a collection of images and corresponding detailed descriptions. The fine-tuned variants can generate long, detailed captions of images, which are “more factually aligned sentences” than those produced by other VLMs.
Google created other fine-tuned versions for benchmarking purposes. The benchmark tasks included OCR, table structure recognition, molecular structure recognition, optical music score recognition, radiography report generation, and spatial reasoning. The fine-tuned PaliGemma 2 outperformed previous state-of-the-art models on most of these tasks.
The team also evaluated performance and inference speed for quantized versions of the model running on a CPU instead of a GPU. Reducing the model weights from full 32-bit to mixed-precision quantization showed “no practical quality difference.”
In a Hacker News discussion about the model, one user wrote:
Paligemma proves easy to train and useful in fine-tuning. Its main drawback was not being able to handle multiple images without being partly retrained. This new version does not seem to support multiple images as input at once. Qwen2vl does. This is useful for vision RAG typically.
Gemma team member Glenn Cameron wrote about PaliGemma 2 on X. In response to a question about using it to control a robot surgeon, Cameron said:
I think it could be taught to generate robot commands. But I wouldn’t trust it with such high-stakes tasks…Notice the name of the model is PaLM (Pathways Language Model). The “Pa” in PaliGemma stands for “Pathways”. It is named that because it continues the line of PaLI (Pathways Language and Image) models in a combination with the Gemma family of language models.
InfoQ previously covered Google’s work on using VLMs for robot control, including Robotics Transformer 2 (RT-2) and PaLM-E, a combination of their PaLM and Vision Transformer (ViT) models.
The PaliGemma 2 base models as well as fine-tuned versions and a script for fine-tuning the base model are available on Huggingface. Huggingface also hosts a web-based visual question answering demo of a fine-tuned PaliGemma 2 model.