Qwen Team recently open sourced Qwen-Image, an image foundation model. Qwen-Image supports text-to-image (T2I) generation and text-image-to-image (TI2I) editing tasks, and outperforms other models on a variety of benchmarks.
Qwen-Image uses a Qwen2.5-VL for text inputs, a Variational AutoEncoder (VAE) for image inputs, and a Multimodal Diffusion Transformer (MMDiT) for image generation. The combined model “excels” at text rendering, including both English and Chinese text. Qwen evaluated the model on a suite of T2I and TI2I benchmarks, including DPG, GenEval, GEdit and ImgEdit, where it achieved the highest overall score. On image understanding tasks, while not as good as specially trained models, Qwen-Image has performance “remarkably close” to theirs. In addition, Qwen created AI Arena, a comparison site where human evaluators can rate pairs of generated images. Qwen-Image currently ranks third, in competition with five high quality closed models including GPT Image 1. According to Qwen:
Qwen-Image is more than a state-of-the-art image generation model—it represents a paradigm shift in how we conceptualize and build multimodal foundation models. Its contributions extend beyond technical benchmarks, challenging the community to rethink the roles of generative models in perception, interface design, and cognitive modeling…As we continue to scale and refine such systems, the boundary between visual understanding and generation will blur further, paving the way for truly interactive, intuitive, and intelligent multimodal agents.
To create the model’s training dataset, the Qwen Team “collected and annotated billions of image-text pairs” with images from four main categories: nature, design, people, and “synthetic data.” Nature images are about 55% of the data. Design, which includes images of paintings, posters, and GUIs, is about 27% of the data and includes many images with “rich textual elements.” This initial dataset was heavily filtered to remove low-quality images. They also designed an annotation framework to generate detailed captions and metadata for each image.
Qwen-Image Model Architecture. Image Source: Qwen-Image Tech Report
The Qwen Team designed a pre-training curriculum with multiple strategies that progressively improved the model’s output. The first strategy involved upscaling images from 256×256 to 640×640 then to 1328×1328 pixels. The other strategies involved introducing images containing rendered text, images with a more varied distribution of domains and resolution, and synthetic images with “surrealistic styles or…extensive textual content.”
Finally the model was post-trained in two stages. First was supervised fine-tuning (SFT) on a dataset with “meticulous human annotation” to produce detailed and realistic images. Next was reinforcement learning (RL) using two different policy optimization strategies, where the model produced multiple images for a prompt and human judges picked the best and worst.
Hacker News users generally praised the model’s performance, comparing it to gpt-image-1. One user said of the release, “this seems huge.” Another wrote:
Besides style transfer, object additions and removals, text editing, manipulation of human poses, it also supports object detection, semantic segmentation, depth/edge estimation, super-resolution and novel view synthesis (NVS) i.e. synthesizing new perspectives from a base image. It’s quite a smorgasbord! Early results indicate to me that gpt-image-1 has a bit better sharpness and clarity but I’m honestly not sure if OpenAI doesn’t simply do some basic unsharp mask or something as a post-processing step?
The Qwen-Image code is available on GitHub and model files can be downloaded from Huggingface.