SINGAPORE, Aug. 13, 2025 /PRNewswire/ — Theย SkyWork AI Technology Release Week officially kicked off on August 11. From August 11 to August 15, SkyWork releases one new model each day for five consecutive days, covering cutting-edge models for core multimodal AI scenarios. Skywork has already launched the SkyReels-A3, Matrix-Game 2.0, and Matrix-3D models.
On August 13, the Skywork UniPic 2.0 model was officially open-sourced. It is an efficient training and inference framework for unified multimodal modeling, designed with lightweight generation and editing modules while integrating multimodal understanding models for joint training. This equips it with unified core capabilitiesโunderstanding, image generation, and editingโwith the goal of achieving an “efficient, high-quality, and unified” multimodal generative model.
Skywork UniPic 2.0 and its model series are now fully open-source, releasing model weights, inference code, and optimization strategies. They will enable developers and researchers to rapidly deploy and develop multimodal applications.
Project homepage:
Technical report:
https://github.com/SkyworkAI/UniPic/blob/main/UniPic-2/assets/pdf/UNIPIC2.pdf
GitHub:
https://github.com/SkyworkAI/UniPic/tree/main/UniPic-2
HuggingFace Gradio:
https://huggingface.co/spaces/Skywork/UniPic2-Metaquery
HuggingFace Model:
https://huggingface.co/Skywork/UniPic2-SD3.5M-Kontext-2B; https://huggingface.co/Skywork/UniPic2-Metaquery-9B
Skywork UniPic 2.0 consists of three core modules:
Image generation & editing: Based on the SD3.5-Medium architecture, the originally text-only model has been upgraded to process both text and image inputs simultaneously. Through training on high-quality image generation and editing datasets, its functionality has evolved from standalone image generation to integrated generation and editing capabilities.
Unified model capability : By freezing the image generation/editing module and leveraging a multimodal model (Qwen2.5-VL-7B) with a pre-trained connector, we have established integrated understanding/generation/editing capabilities. Through joint fine-tuning of both the connector and the image generation/editing module, a unified model capable of seamless understanding, generation, and editing has been achieved.
Post-training for image generation & editing: To boost overall performance, we have developed a Flow-GRPO-based progressive dual-task reinforcement strategy. This approach achieves collaborative optimization of generation and editing tasks without cross-interference, yielding performance gains beyond standard pre-training.
The upgraded Skywork UniPic 2.0 delivers the following key advantages:
Lightweight yet high-performance generation module:
Built on the 2B-parameter SD3.5-Medium architecture, our generation module surpasses competitors in both image generation and editing benchmarks โ including models like Bagel (7B params), OmniGen2 (4B params), UniWorld-V1 (12B params), and Flux-kontext.
Enhanced reinforcement learning capability:
Our groundbreaking Flow-GRPO-based progressive dual-task reinforcement strategy significantly enhances the model’s ability to interpret complex instructions, and maintain consistency across image generation and editing tasks. All while enabling collaborative optimization without cross-task interference.
Unified architecture with scalable adaptation
The system features seamless end-to-end integration of the Kontext image generation/editing model with multimodal architectures. Through lightweight connector fine-tuning, users can rapidly deploy unified understanding-generation-editing models while further improving both generation and editing performance.
The UniPic2-SD3.5M-Kontext model achieves remarkable performance despite its compact 2B parameter size. In comprehensive benchmarks, it surpasses both Flux.dev (12B parameters) in image generation metrics and Flux-Kontext (12B parameters) in editing performance. Furthermore, it outperforms nearly all existing unified models – including UniWorld-V1 (19B parameters) and Bagel (14B parameters) – across both generation and editing tasks.
When extended into the unified UniPic2-Metaquery architecture, the model demonstrates additional performance gains, showcasing exceptional scalability beyond its already impressive baseline capabilities.
Skywork UniPic 2.0’s exceptional understanding, generation, and editing capabilities are powered by the Skywork team’s groundbreaking optimizations across all training stages โ from pre-training and co-training to post-training refinement.
Pre-Training (image generation/editing model)
SD3.5-Medium was initially trained to synthesize images from both textual instructions and reference images while preserving its original architecture. The system processes text inputs (encoded into instruction representations via the text encoder) and reference images (compressed into latent variables by the VAE and projected as context tokens). These components are then concatenated with the target image’s noise tokens into a unified sequence, where the model’s inherent positional encoding maintains clear differentiation between reference and target tokens. This methodology retains SD3.5M’s native structure while simultaneously enabling both text-to-image (T2I) generation and text-conditioned image editing (I2I).
Joint-Training
Starting from our pre-trained image generation/editing model, we implement the Metaquery framework to achieve cross-modal alignment between Qwen2.5-VL (multimodal) and the image synthesis model, thereby creating a unified architecture. This integration is achieved through two key processes:
Connector pre-training
We substituted SD3.5M’s original T5 text encoder with Qwen2.5-VL and a Connector, maintaining frozen weights in both Qwen2.5-VL and SD3.5M’s DiT backbone. The Connector underwent pre-training on 100M+ curated image-generation samples to establish precise feature-space alignment between Qwen2.5-VL’s transformed outputs (via the Connector) and SD3.5M’s DiT input expectations.
Joint SFT training
Following connector pre-training, we replaced SD3.5M with the pre-trained UniPic2-SD3.5M-Kontext model (specialized in image generation/editing), then unfreezed both the connector and UniPic2-SD3.5M-Kontext parameters. Using high-quality generation and editing datasets, we jointly trained the connector and Kontext model to achieve optimal unified performance. The resulting UniPic2-Metaquery model not only preserves the base multimodal model’s comprehension capabilities but also exhibits superior generation and editing performance compared to the standalone Kontext model.
Post-training: Multi-task reinforcement learning for concurrent generation/editing enhancement
Traditional multi-task RL often faces performance trade-offs, where optimizing one task compromises another. To overcome this limitation, we pioneered a progressive Flow-GRPO-based dual-task reinforcement strategy that achieves breakthrough concurrent optimization of text-to-image generation and image editing within a unified architecture. This represents the first demonstrated instance of interference-free, synergistic task improvement in multimodal model development.
As a pioneer in AI technology, Skywork continues to redefine the frontiers of artificial intelligence. In recent months, we have open-sourced multiple state-of-the-art foundation models that established new industry standards, including SkyReels-V1: the first video generation model specialized for AI-driven short film production; SkyReels-V2: the world’s first unlimited-duration cinematic generation model employing a diffusion-forcing framework; and SkyReels-A3: an audio-driven portrait video generation model.
In multimodal AI development, Skywork has introduced two groundbreaking advancements: (1) the Skywork-R1V seriesโa 38B-parameter multimodal reasoning model that effectively bridges textual and visual reasoning while matching the performance of significantly larger proprietary models, and (2) pioneering spatial intelligence systems including the Matrix-Game 2.0 interactive world model and Matrix-3D generative world model.
Recommended for you:
How Meta AI on Instagram Helps You Create Images, Modify Texts, And More