Hugging Face Inc. today open-sourced SmolVLM-256M, a new vision language model with the lowest parameter count in its category.
The algorithm’s small footprint allows it to run on devices such as consumer laptops that have relatively limited processing power. According to Hugging Face, it could potentially run in browsers as well. The latter feature is facilitated by the model’s support for WebGPU, a technology that allows AI-powered web applications to use the graphics cards in the user’s computer.
SmolVLM-256M lends itself to a range of tasks that involve processing visual data. It can answer questions about scanned documents, describe videos and explain charts. Hugging Face has also developed a version of the model that can customize its output based on user prompts.
Under the hood, SmolVLM-256M features 256 million parameters. That’s a small fraction of the hundreds of billions of parameters included in the most advanced foundation models. The lower a model’s parameter count, the less hardware it uses, which is the reason SmolVLM-256M can run on devices such as laptops.
The algorithm is the latest in a series of open-source vision language models released by Hugging Face. Compared with the company’s earlier models, one of the main improvements in SmolVLM-256M is that it uses a new encoder. This is a software module tasked with turning the files an AI processes into encodings, mathematical structures that neural networks can work with more easily.
SmolVLM-256M’s encoder is based on an open-source AI called SigLIP base patch-16/512. The latter algorithm, in turn, is derived from an image processing model that OpenAI released in 2021. The encoder includes 93 million parameters, less than one-fourth the number of parameters in Hugging Face’s previous-generation encoder, which helped the company reduce SmolVLM-256M’s hardware footprint.
“As a bonus, the smaller encoder processes images at a larger resolution, which (per research from Apple and Google) can often yield better visual understanding without ballooning parameter counts,” Hugging Face engineers Andres Marafioti, Miquel Farré and Merve Noyan wrote in a blog post.
The company trained the AI on an improved version of a dataset it used to develop its previous-generation vision language models. To boost SmolVLM-256M’s reasoning skills, Hugging Face expanded the dataset with a collection of handwritten mathematical expressions. The company also made other additions designed to hone model’s document understanding and image captioning skills.
In an internal evaluation, Hugging Face compared SmolVLM-256M against a multimodal model with 80 billion parameters that it released 18 months ago. The former algorithm achieved higher scores across more than a half-dozen benchmarks. In a benchmark called MathVista that includes geometry problems, SmolVLM-256M’s score was more than 10% higher.
Hugging Face is rolling out the model alongside a second, more capable algorithm called SmolVLM-500M that features 500 million parameters. It trades off some hardware efficiency for higher output quality. According to Hugging Face, SmolVLM-500M is also better at following user instructions.
“If you need more performance headroom while still keeping the memory usage low, SmolVLM-500M is our half-billion-parameter compromise,” the company’s engineers wrote.
Hugging Face has uploaded the two models’ source code to its namesake AI project hosting platform.
Image: Unsplash
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU