RapTB: Rooted Absorbed Prefix Trajectory Balance with Submodular Replay

RapTB compared with TB and SubTB on terminable LLM-GFlowNets — **Three objectives on terminable LLM‑GFlowNets.** TB uses only the terminal reward log R(s_τ) (O(1) signal) and collapses to near‑identical short prefixes. SubTB adds O(N²) windowed constraints that drift the termination probability, inflating length. RapTB replaces prefix stop‑rewards with suffix‑absorbed targets u_j under O(N) rooted‑prefix constraints — producing diverse, high‑QED molecules.

Abstract

Generative Flow Networks (GFlowNets) enable fine‑tuning large language models to approximate reward‑proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non‑representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance (RapTB), an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix‑based backups, providing dense prefix‑level learning signals. To mitigate replay‑induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLMs using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.

Mode collapse, dissected

LLM‑GFlowNets are supposed to spread probability mass across many high‑reward modes in proportion to reward. In practice, two coupled, reproducible failures appear:

Prefix collapse

Early‑token entropy drops sharply; distinct terminals share near‑identical prefixes and only branch late. Root cause: terminal‑only rewards give high‑variance, ambiguous credit to intermediate steps.

Length bias

The policy systematically favors sequences that are too short (TB) or too long (SubTB). Root cause: biased replay reinforces a narrow subset, while overlapping window constraints drift the termination logits.

Method

1 · RapTB: rooted, absorbed prefix credit

RapTB keeps terminal Trajectory Balance as the single exact balance constraint — the only condition whose optimum matches the reward‑proportional target — and adds a lightweight auxiliary term that delivers dense, low‑variance supervision at every prefix.

A Rooted prefix residuals

Instead of constraining all subtrajectories (as in SubTB), we only constrain those rooted at the source s₀. Subtracting the root residual cancels the global constant log Z_θ, giving a clean local consistency signal with no conflicting window boundaries:

\[ \bar\Delta_k(\xi) \;\triangleq\; \Delta_k^{\mathrm{TB}}(\xi) - \Delta_0^{\mathrm{TB}}(\xi) \]

B Absorbed suffix rewards

We build a low‑variance target for each prefix by backing up rewards from the observed suffix — a max term (a lower bound on prefix credit) blended with a distance‑discounted soft‑max:

\[ u_k^{\mathrm{tgt}} = \alpha\,\underbrace{\max_{j\in[k,\tau]} u_j}_{u_k^{\max}} + (1-\alpha)\,\underbrace{\tfrac{1}{\beta}\log\!\textstyle\sum_{j=k}^{\tau} e^{\beta u_j - \beta\rho (j-k)}}_{u_k^{\mathrm{soft}}} \]

The auxiliary loss simply re‑computes the rooted residual against this absorbed target (with the termination head detached to prevent length drift), and the final objective keeps TB as the anchor:

\[ \mathcal{L}_{\mathrm{RapTB}} = \mathbb{E}_{\xi\sim P_F^\theta}\Big[\; \underbrace{\Delta^{\mathrm{TB}}(\xi)^2}_{\textbf{Anchor (exact)}} \;+\; \eta\,\underbrace{\mathcal{L}_{\mathrm{aux}}(\xi)}_{\textbf{Partial credit (variance-reducing)}} \;\Big] \]

Fixed‑point guarantee. The TB deviation of any RapTB minimizer is bounded by η·𝓛_aux(θ*) and vanishes as η→0, so the auxiliary term regularizes without ever destroying the global TB anchor.

2 · SubM: submodular, coverage‑aware replay

Reward‑prioritized replay creates rich‑get‑richer dynamics. SubM instead refreshes a fixed‑size buffer by greedily maximizing a monotone submodular objective over the union of the current buffer and the new batch — jointly rewarding quality, facility‑location diversity, and length coverage:

\[ f(S) = \underbrace{\sum_{x\in S}\mathrm{static}(x)}_{\text{quality}} + \lambda_{\mathrm{div}}\underbrace{\sum_{v\in\mathcal{G}}\max_{x\in S}\mathrm{sim}(v,x)}_{\text{diversity coverage}} + \lambda_{\mathrm{len}}\underbrace{\textstyle\sum_b \alpha_b\log(1+c_b(S))}_{\text{length coverage}} \]

Greedy selection inherits a (1-1/e) near‑optimality guarantee; with cached similarities each refresh costs O(B|𝒢|) — about 10 ms of overhead per update.

Results

+0.13

QED score over TB on SMILES (0.844 vs 0.717)

2×

coverage of best baseline on Expr24 (NormCov 0.209 vs 0.100)

0.97+

validity preserved where SubTB drops to 0.33

32B

consistent gains across scales 1B → 3B → 8B → 32B

Scaffold‑conditioned SMILES generation

RL baselines maximize reward and collapse to a single mode (PPO entropy ≈ 0). Among GFlowNet objectives, TB is valid but short and low‑diversity; SubTB drifts and loses validity. RapTB + SubM wins the quality–diversity trade‑off while keeping validity high.

Method	Acc ↑	Score ↑	Entropy ↑	FPDiv ↑	Len
PPO	1.000	0.604	≈0	—	—
GRPO	0.997	0.661	0.98	—	10.0
TB	0.998	0.717	2.503	0.807	3.065
SubTB	0.328	0.755	2.127	0.836	8.354
RapTB	0.996	0.740	2.448	0.860	6.142
RapTB + SubM	0.988	0.844	2.726	0.898	7.435

Metrics on valid samples; Len is average token length. RapTB+SubM is best on Score, Entropy and FPDiv.

Valid-only length histogram — **Length distribution.** TB piles onto length 0–2; SubTB onto 9–10. RapTB (+SubM) spreads mass across the horizon.

Prefix-collapse diagnostics — **Prefix‑collapse diagnostics.** TB shows rapid survival decay and a spiking top‑1 prefix mass; RapTB sustains high prefix entropy and low concentration deep into the trajectory.

Score and FP-diversity versus length — **Length‑stratified quality.** Conditioned on length, RapTB+SubM holds the highest score and fingerprint diversity even in the long‑sequence regime where TB degrades.

Beyond molecules: Expr24 & CommonGen

Expr24 (sparse, enumerable reward)

Generate an arithmetic expression evaluating to 24. Under standard replay, TB collapses to Unique_✓≈5. RapTB recovers diversity at near‑perfect accuracy, and RapTB+SubM doubles normalized coverage (0.209 vs 0.100) over the strongest baseline. The enumerable solution set lets us verify lower KL/JS to the true reward distribution.

CommonGen (real linguistic priors)

With a frozen reference LM as anchor, SubTB deviates catastrophically — saturating length (20.0) by suppressing stop logits (Δlog p_term≈−28). RapTB stays calibrated, and RapTB+SubM reaches BLEU‑4 33.2 at natural length (11.8) with the highest entropy.

Scaling & generality. Gains hold on AMP biological sequence generation and across Llama‑3.2 (1B/3B/8B) and Qwen3‑32B. SubTB's termination drift persists at every scale (Acc 0.31/0.39/0.80 at 3B/8B/32B), confirming the failure is structural, while RapTB+SubM gives the best quality–diversity trade‑off at all sizes.

Key takeaways

Mode collapse in LLM‑GFlowNets is a reproducible combination of prefix collapse and length bias, driven by high‑variance terminal credit and replay‑induced distribution shift.
RapTB adds rooted‑prefix supervision with suffix‑absorbed targets — dense, low‑variance credit that never breaks the exact TB anchor.
SubM reframes replay refresh as cheap submodular maximization, broadening coverage with a greedy near‑optimality guarantee.
Together they deliver the best reward–diversity trade‑off, preserve validity, and scale to 32B across molecular, symbolic, biological, and text tasks.

BibTeX

@inproceedings{wang2026raptb,
  title         = {Rooted Absorbed Prefix Trajectory Balance with Submodular
                   Replay for {GFlowNet} Training},
  author        = {Wang, Xi and Lu, Wenbo and Wang, Shengjie},
  booktitle     = {Proceedings of the 43rd International Conference on
                   Machine Learning (ICML)},
  year          = {2026},
  eprint        = {2603.00454},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2603.00454}
}

Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training