⚓️ AdaGRPO: A Plug-and-Play Adaptive Enhancement

for Flow-based GRPO

Jiazi Bu1,2,6* Pengyang Ling3*§ Yujie Zhou1,6* Yibin Wang4,9 Yuhang Zang6

Tianyi Wei2 Xiaohang Zhan7 Jiaqi Wang9 Tong Wu8† Xingang Pan2† Dahua Lin5,6,10

1 Shanghai Jiao Tong University    2 S-Lab, Nanyang Technological University
3 University of Science and Technology of China    4 Fudan University    5 The Chinese University of Hong Kong
6 Shanghai AI Laboratory    7 Adobe Research    8 Stanford University    9 Shanghai Innovation Institute    10 CPII under InnoHK
(* Equal Contribution   § Project leader   Corresponding Author)

[Paper (Coming Soon)]      [Code (Coming Soon)]


Abstract

Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner's current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy—a factor proven crucial in GRPO for large language models; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: Online Curriculum Filtering Strategy: Dynamically tracks the model's proficiency and adaptively selects prompts that best match its current learning boundary; Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models. Our code will be released at AdaGRPO Repo.

Observation and Motivation

Current flow-based GRPO frameworks are fundamentally decoupled from the model's evolving capability during training.
(1) Prompt Selection. Existing methods sample prompts blindly at random. Inspired by prompt selection strategies in RL for LLM, we investigate the impact of prompt difficulty on flow-based GRPO. As shown in Fig.1 (a), training upon the "easiest" prompts (highest ODE rewards) causes severe performance degradation, while employing the "hardest" prompts (lowest ODE rewards) barely outperforms the random baseline. In contrast, medium prompts drive notable gains, corroborating the established finding in LLM alignment. However, the median reward of an isolated batch is biased, as it is detached from the model's aggregate proficiency (e.g., the median of a universally challenging batch remains overly difficult for the model).
(2) Advantage Estimation. Current methods typically evaluate samples solely via intra-group rewards and thus exhibit severe "myopia". They erroneously assign positive advantages to subpar samples simply because they are above the local intra-group mean, even if they fall below the model's global capability (false positives), while penalize high-quality samples that fall below the local mean but actually surpass the global capability (false negatives), as shown in Fig.1 (b). Without a reliable reference to gauge absolute policy progression, these local biases inevitably obscure the true optimization direction.

Figure 1: Key Observations.

Methodology

AdaGRPO is a novel capability-aware RL algorithm tailored for flow models, featuring two principal components.
(i) Online Curriculum Filtering Strategy is introduced to apply prompt selection. Rooted in curriculum learning, this module maintains an Exponential Moving Average (EMA) of historical rewards to explicitly track the model's global generation proficiency, adaptively selecting candidate prompts perfectly at the current learning boundary. This eliminates localized batch bias and ensures a highly constructive optimization landscape.
(ii) Cross-Level Advantage Fusion is proposed to calibrate advantage estimation. By synergistically fusing intra-group local advantages with macro-level global advantages, samples are rewarded not only for outperforming their immediate peers but also for surpassing the model's past capability bounds, yielding an unbiased signal of absolute policy progression.

Figure 2: Overview of AdaGRPO.

Qualitative Comparison

Qualitative comparisons with existing flow-based GRPO methods (with and without AdaGRPO). AdaGRPO consistently elevates the performance of baseline frameworks in visual fidelity, aesthetic appeal, and semantic adherence.

Figure 3: Qualitative Comparisons with Baselines on HPS-v2 (with and without AdaGRPO).


Figure 4: Qualitative Comparisons with Baselines on HPS-v3 (with and without AdaGRPO).

Quantitative Evaluation

Quantitative assessments of the proposed AdaGRPO and baseline methods. Under both single reward (HPS-v2/v3) and multi-reward (HPS-v3 + CLIP) settings, AdaGRPO consistently brings substantial improvements to the prevailing baselines (Flow-GRPO, DanceGRPO, and Flow-CPS), validating its effectiveness and architecture-agnostic nature.

Figure 5: Reward Curves during Training (with and without AdaGRPO).


Gallery of AdaGRPO

We present additional visual results of the proposed AdaGRPO. More samples can be found in our appendix.

Figure 6: Additional Visual Samples of AdaGRPO (1/4).


Figure 7: Additional Visual Samples of AdaGRPO (2/4).


Figure 8: Additional Visual Samples of AdaGRPO (3/4).


Figure 9: Additional Visual Samples of AdaGRPO (4/4).

BibTex

If you find this work helpful, please cite the following paper:

    TBD
    
  

Project page template is borrowed from FreeScale.