🎨 From Sparse to Dense: Multi-View GRPO for Flow Models
via Augmented Condition Space

Jiazi Bu^1,2,6* Pengyang Ling^3* Yujie Zhou^1,6* Yibin Wang^4,9 Yuhang Zang⁶

Tianyi Wei² Xiaohang Zhan⁷ Jiaqi Wang⁹ Tong Wu^8† Xingang Pan^2† Dahua Lin^5,6,10

¹ Shanghai Jiao Tong University    ² S-Lab, Nanyang Technological University
³ University of Science and Technology of China    ⁴ Fudan University    ⁵ The Chinese University of Hong Kong
⁶ Shanghai AI Laboratory    ⁷ Adobe Research    ⁸ Stanford University    ⁹ Shanghai Innovation Institute    ¹⁰ CPII under InnoHK
(^* Equal Contribution ^† Corresponding Author)

[Paper] [Code (Coming Soon)]

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

Observation and Motivation

Existing flow-based GRPO methods typically operate under a ''Single-View'' paradigm: they evaluate the generated group solely against the single initial condition. This reward evaluation protocol can be reinterpreted as a sparse, one-to-many mapping from the condition space \( \mathcal{C} \) to the data space \( \mathcal{X} \), as shown in the Figure.1 (a). Fundamentally, this paradigm models intra-group relationships by ranking samples based on their alignment with a singular condition, ignoring the multifaceted nature of visual semantics. For instance, as illustrated in Figure.2, given a SDE sample describing a cat and a dog within a teacup, it may rank poorly under one condition (''A cat and a dog in a teacup.'') but highly under another similar condition specifying visual attributes like lighting, motion or composition. Consequently, relying solely on the ranking derived from a single prompt is insufficient to gauge the nuanced relationships among samples, resulting in an inherently sparse reward mapping. Consequently, relying solely on the ranking derived from a single prompt is insufficient to gauge the nuanced relationships among samples, resulting in an inherently sparse reward mapping. In contrast, by incorporating the diverse rankings induced by novel prompts, we can effectively densify the condition-data reward signal, as depicted in Figure.1 (b). This strategy serves dual purposes: (i) enabling a more comprehensive exploration of intra-group relationships from multiple perspectives, and (ii) establishing intrinsic contrasts by identifying ranking shifts across different conditions, thereby facilitating preference-aligned generation under various conditions.

Figure 1: Reward Evaluation in GRPO Training.

Figure 2: Reward Ranking Varies with Conditions.

Methodology

Given a group of SDE samples, MV-GRPO first leverages a flexible Condition Enhancer module (a pretrained VLM or LLM) to generate diverse augmented conditions. These augmented descriptors, along with the original condition, form a multi-view condition cluster for dense condition-data reward signals, facilitating comprehensive advantage estimation. We justify optimizing the policy conditioned on an augmented view using trajectories generated under the original condition in the Theoretical Perspective part of our paper.

Figure 3: Overview of MV-GRPO.

Qualitative Comparison

Qualitative comparisons with existing flow-based GRPO methods. MV-GRPO consistently outperforms the baselines in semantic alignment, visual fidelity, and structural coherence across diverse scenarios and reward models.

Figure 4: Qualitative Comparisons with Baselines on HPS-v3.

Figure 5: Qualitative Comparisons with Baselines on UnifiedReward-v2.

Quantitative Evaluation

Quantitative assessments of the proposed MV-GRPO and other baselines. MV-GRPO demonstrates consistent superiority under both single reward (HPS-v3 or UnifiedReward-v2) and multi-reward (HPS-v3 + CLIP) settings. The reward curves during training further validate that our MV-GRPO outperforms baselines in convergence speed and performance ceiling.