DeepSeek has released Janus-Pro, an updated version of its multimodal model, Janus. The new model improves training strategies, data scaling, and model size, enhancing multimodal understanding and text-to-image generation.
Janus-Pro separates visual encoding for understanding and generation tasks, addressing stability and performance issues. The model also incorporates synthetic aesthetic data to enhance text-to-image generation, it also follows an autoregressive framework that separates visual encoding pathways for multimodal understanding and generation while maintaining a single transformer architecture. This design increases flexibility and reduces conflicts in the visual encoder’s roles, achieving competitive performance with task-specific models while keeping a unified structure.
Janus-Pro improves multimodal understanding and visual generation performance. Multimodal understanding is measured using the average accuracy of POPE, MME-Perception (scaled), GQA, and MMMU. Visual generation is evaluated on GenEval and DPG-Bench. Janus-Pro outperforms previous unified multimodal models and some task-specific models.
The model is based on DeepSeek-LLM-1.5B and DeepSeek-LLM-7B. The larger model performs better on benchmarks like MMBench and GenEval. It uses SigLIP-L as its vision encoder and supports 384×384 image inputs. Image generation relies on a tokenizer with a downsampling rate of 16.
DeepSeek’s Janus-Pro-7B and OpenAI’s DALL-E 3 are both advanced models in text-to-image generation. According to DeepSeek, Janus-Pro-7B outperforms DALL-E 3 in benchmarks such as GenEval and DPG-Bench. This performance is attributed to Janus-Pro-7B’s improved training processes, data quality, and model size, which contribute to more stable and detailed images.
The release of DeepSeek Janus has generated significant buzz and comments, Vedang Vatsa FRSA shared:
DeepSeek’s Janus-Pro-7B is here. Outperforms DALL-E 3 & Stable Diffusion on GenEval/DPG-Bench. Separates understanding/generation, scales data/models for stable image gen. Unified, flexible, cost-efficient. Open-source win!.
And, AI expert Huzaifa Shoukat posted:
DeepSeek’s new Janus Pro model is impressive. It’s a multimodal LLM that understands images and generates them too. The 1B model runs in the browser using WebGPU via Transformers.js.
Janus-Pro is available on GitHub under the MIT License, with model usage governed by the DeepSeek Model License. Users can refer to the repository for setup instructions.