🎨 From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space
Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.
Existing flow-based GRPO methods typically operate under a ''Single-View'' paradigm: they evaluate the generated group solely against the single initial condition. This reward evaluation protocol can be reinterpreted as a sparse, one-to-many mapping from the condition space \( \mathcal{C} \) to the data space \( \mathcal{X} \), as shown in the Figure.1 (a). Fundamentally, this paradigm models intra-group relationships by ranking samples based on their alignment with a singular condition, ignoring the multifaceted nature of visual semantics. For instance, as illustrated in Figure.2, given a SDE sample describing a cat and a dog within a teacup, it may rank poorly under one condition (''A cat and a dog in a teacup.'') but highly under another similar condition specifying visual attributes like lighting, motion or composition. Consequently, relying solely on the ranking derived from a single prompt is insufficient to gauge the nuanced relationships among samples, resulting in an inherently sparse reward mapping. Consequently, relying solely on the ranking derived from a single prompt is insufficient to gauge the nuanced relationships among samples, resulting in an inherently sparse reward mapping. In contrast, by incorporating the diverse rankings induced by novel prompts, we can effectively densify the condition-data reward signal, as depicted in Figure.1 (b). This strategy serves dual purposes: (i) enabling a more comprehensive exploration of intra-group relationships from multiple perspectives, and (ii) establishing intrinsic contrasts by identifying ranking shifts across different conditions, thereby facilitating preference-aligned generation under various conditions.
Given a group of SDE samples, MV-GRPO first leverages a flexible Condition Enhancer module (a pretrained VLM or LLM) to generate diverse augmented conditions. These augmented descriptors, along with the original condition, form a multi-view condition cluster for dense condition-data reward signals, facilitating comprehensive advantage estimation. We justify optimizing the policy conditioned on an augmented view using trajectories generated under the original condition in the Theoretical Perspective part of our paper.
Qualitative comparisons with existing flow-based GRPO methods. MV-GRPO consistently outperforms the baselines in semantic alignment, visual fidelity, and structural coherence across diverse scenarios and reward models.
Quantitative assessments of the proposed MV-GRPO and other baselines. MV-GRPO demonstrates consistent superiority under both single reward (HPS-v3 or UnifiedReward-v2) and multi-reward (HPS-v3 + CLIP) settings. The reward curves during training further validate that our MV-GRPO outperforms baselines in convergence speed and performance ceiling.
We present additional visual results of the proposed MV-GRPO. More samples can be found in our supplementary material.
If you find this work helpful, please cite the following paper:
Project page template is borrowed from FreeScale.