ICML 2026

Rooted Absorbed Prefix Trajectory Balance
with Submodular Replay for GFlowNet Training

Xi Wang1, Wenbo Lu1, Shengjie Wang1

1Courant Institute School of Mathematics, Computing, and Data Science, New York University

Corresponding author: sw5973@nyu.edu

TL;DR — Training LLM‑GFlowNets collapses into a few short, near‑identical samples. RapTB propagates terminal reward to every prefix via absorbed‑suffix backups, and SubM refreshes the replay buffer with a submodular reward‑diversity‑length objective. Together they cure prefix collapse and length bias, yielding markedly more diverse, higher‑quality generations at scales up to 32B.

RapTB compared with TB and SubTB on terminable LLM-GFlowNets
Three objectives on terminable LLM‑GFlowNets. TB uses only the terminal reward log R(sτ) (O(1) signal) and collapses to near‑identical short prefixes. SubTB adds O(N²) windowed constraints that drift the termination probability, inflating length. RapTB replaces prefix stop‑rewards with suffix‑absorbed targets uj under O(N) rooted‑prefix constraints — producing diverse, high‑QED molecules.

Abstract

Generative Flow Networks (GFlowNets) enable fine‑tuning large language models to approximate reward‑proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non‑representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance (RapTB), an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix‑based backups, providing dense prefix‑level learning signals. To mitigate replay‑induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLMs using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.

Mode collapse, dissected

LLM‑GFlowNets are supposed to spread probability mass across many high‑reward modes in proportion to reward. In practice, two coupled, reproducible failures appear:

Prefix collapse

Early‑token entropy drops sharply; distinct terminals share near‑identical prefixes and only branch late. Root cause: terminal‑only rewards give high‑variance, ambiguous credit to intermediate steps.

Length bias

The policy systematically favors sequences that are too short (TB) or too long (SubTB). Root cause: biased replay reinforces a narrow subset, while overlapping window constraints drift the termination logits.

Method

1  ·  RapTB: rooted, absorbed prefix credit

RapTB keeps terminal Trajectory Balance as the single exact balance constraint — the only condition whose optimum matches the reward‑proportional target — and adds a lightweight auxiliary term that delivers dense, low‑variance supervision at every prefix.

A Rooted prefix residuals

Instead of constraining all subtrajectories (as in SubTB), we only constrain those rooted at the source s0. Subtracting the root residual cancels the global constant log Zθ, giving a clean local consistency signal with no conflicting window boundaries:

\[ \bar\Delta_k(\xi) \;\triangleq\; \Delta_k^{\mathrm{TB}}(\xi) - \Delta_0^{\mathrm{TB}}(\xi) \]

B Absorbed suffix rewards

We build a low‑variance target for each prefix by backing up rewards from the observed suffix — a max term (a lower bound on prefix credit) blended with a distance‑discounted soft‑max:

\[ u_k^{\mathrm{tgt}} = \alpha\,\underbrace{\max_{j\in[k,\tau]} u_j}_{u_k^{\max}} + (1-\alpha)\,\underbrace{\tfrac{1}{\beta}\log\!\textstyle\sum_{j=k}^{\tau} e^{\beta u_j - \beta\rho (j-k)}}_{u_k^{\mathrm{soft}}} \]

The auxiliary loss simply re‑computes the rooted residual against this absorbed target (with the termination head detached to prevent length drift), and the final objective keeps TB as the anchor:

\[ \mathcal{L}_{\mathrm{RapTB}} = \mathbb{E}_{\xi\sim P_F^\theta}\Big[\; \underbrace{\Delta^{\mathrm{TB}}(\xi)^2}_{\textbf{Anchor (exact)}} \;+\; \eta\,\underbrace{\mathcal{L}_{\mathrm{aux}}(\xi)}_{\textbf{Partial credit (variance-reducing)}} \;\Big] \]

  Fixed‑point guarantee. The TB deviation of any RapTB minimizer is bounded by η·𝓛aux(θ*) and vanishes as η→0, so the auxiliary term regularizes without ever destroying the global TB anchor.

2  ·  SubM: submodular, coverage‑aware replay

Reward‑prioritized replay creates rich‑get‑richer dynamics. SubM instead refreshes a fixed‑size buffer by greedily maximizing a monotone submodular objective over the union of the current buffer and the new batch — jointly rewarding quality, facility‑location diversity, and length coverage:

\[ f(S) = \underbrace{\sum_{x\in S}\mathrm{static}(x)}_{\text{quality}} + \lambda_{\mathrm{div}}\underbrace{\sum_{v\in\mathcal{G}}\max_{x\in S}\mathrm{sim}(v,x)}_{\text{diversity coverage}} + \lambda_{\mathrm{len}}\underbrace{\textstyle\sum_b \alpha_b\log(1+c_b(S))}_{\text{length coverage}} \]

Greedy selection inherits a (1-1/e) near‑optimality guarantee; with cached similarities each refresh costs O(B|𝒢|) — about 10 ms of overhead per update.

Results

+0.13
QED score over TB on SMILES (0.844 vs 0.717)
coverage of best baseline on Expr24 (NormCov 0.209 vs 0.100)
0.97+
validity preserved where SubTB drops to 0.33
32B
consistent gains across scales 1B → 3B → 8B → 32B

Scaffold‑conditioned SMILES generation

RL baselines maximize reward and collapse to a single mode (PPO entropy ≈ 0). Among GFlowNet objectives, TB is valid but short and low‑diversity; SubTB drifts and loses validity. RapTB + SubM wins the quality–diversity trade‑off while keeping validity high.

MethodAcc ↑Score ↑Entropy ↑FPDiv ↑Len
PPO1.0000.604≈0
GRPO0.9970.6610.9810.0
TB0.9980.7172.5030.8073.065
SubTB0.3280.7552.1270.8368.354
RapTB0.9960.7402.4480.8606.142
RapTB + SubM0.9880.8442.7260.8987.435

Metrics on valid samples; Len is average token length. RapTB+SubM is best on Score, Entropy and FPDiv.

Valid-only length histogram
Length distribution. TB piles onto length 0–2; SubTB onto 9–10. RapTB (+SubM) spreads mass across the horizon.
Prefix-collapse diagnostics
Prefix‑collapse diagnostics. TB shows rapid survival decay and a spiking top‑1 prefix mass; RapTB sustains high prefix entropy and low concentration deep into the trajectory.
Score and FP-diversity versus length
Length‑stratified quality. Conditioned on length, RapTB+SubM holds the highest score and fingerprint diversity even in the long‑sequence regime where TB degrades.

Beyond molecules: Expr24 & CommonGen

  Expr24 (sparse, enumerable reward)

Generate an arithmetic expression evaluating to 24. Under standard replay, TB collapses to Unique≈5. RapTB recovers diversity at near‑perfect accuracy, and RapTB+SubM doubles normalized coverage (0.209 vs 0.100) over the strongest baseline. The enumerable solution set lets us verify lower KL/JS to the true reward distribution.

  CommonGen (real linguistic priors)

With a frozen reference LM as anchor, SubTB deviates catastrophically — saturating length (20.0) by suppressing stop logits (Δlog pterm≈−28). RapTB stays calibrated, and RapTB+SubM reaches BLEU‑4 33.2 at natural length (11.8) with the highest entropy.

  Scaling & generality. Gains hold on AMP biological sequence generation and across Llama‑3.2 (1B/3B/8B) and Qwen3‑32B. SubTB's termination drift persists at every scale (Acc 0.31/0.39/0.80 at 3B/8B/32B), confirming the failure is structural, while RapTB+SubM gives the best quality–diversity trade‑off at all sizes.

Key takeaways

BibTeX

@inproceedings{wang2026raptb,
  title         = {Rooted Absorbed Prefix Trajectory Balance with Submodular
                   Replay for {GFlowNet} Training},
  author        = {Wang, Xi and Lu, Wenbo and Wang, Shengjie},
  booktitle     = {Proceedings of the 43rd International Conference on
                   Machine Learning (ICML)},
  year          = {2026},
  eprint        = {2603.00454},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2603.00454}
}