RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

Kai Fukazawa1 · Kunal Mundada2 · Iman Soltani1

1 Dept. of Mechanical and Aerospace Engineering, UC Davis · 2 Dept. of Computer Science, UC Davis

kfukazawa@ucdavis.edu · kmundada@ucdavis.edu · isoltani@ucdavis.edu


TL;DR

RAMAC couples an expressive diffusion actor with a distributional critic to learn risk-aware policies from offline data — achieving strong CVaR performance without sacrificing multimodality or incurring out-of-distribution actions.


Abstract

Offline RL in safety-critical settings must deliver high returns and avoid catastrophic tail outcomes — yet existing risk-aware methods trade away policy expressiveness to achieve safety, while expressive diffusion and flow policies operate only in risk-neutral settings. RAMAC closes this gap by pairing a generative actor with a distributional critic and training jointly on a BC + CVaR objective: the BC term keeps actions on the data manifold, and the CVaR term shifts probability mass away from low-quantile outcomes. We prove that BC regularization directly bounds out-of-distribution action probability — making it a principled, risk-agnostic safety mechanism — and validate this on Stochastic-D4RL, where RAMAC achieves consistent CVaR0.1 gains across six tasks while maintaining strong mean returns.


The Gap

Diffusion and flow policies are the most expressive tools in offline RL — but they optimize mean return only, leaving catastrophic lower-tail outcomes entirely unaddressed. Risk-aware methods like ORAAC and CODAC handle tail risk, but only by constraining the policy class in ways that prevent capturing multimodal action distributions. No prior work achieves risk-awareness, OOD control, and expressiveness simultaneously.

Method Risk-Awareness OOD Control Expressiveness
Expressive Risk-Neutral DiffusionQL, FlowQL ✗ / ✓
Risk-Aware Restricted ORAAC, CODAC
RAMAC (Ours)

Comparison of offline RL paradigms on three desiderata. RAMAC is the only approach achieving all three simultaneously.


Our Approach

RAMAC operates in two stages. A distributional critic (IQN) learns the full conditional return distribution, giving access to calibrated lower-tail quantiles. A generative actor — instantiated as a diffusion policy (RADAC) — is trained jointly on two objectives: a BC term that keeps actions on the data manifold, and a CVaR term that shifts probability mass away from low-quantile outcomes.

RAMAC pipeline showing the distributional critic Zφ fitting the return distribution and aggregating its lower tail into a CVaR signal, then differentiating through the generative actor πθ trained with composite BC plus CVaR objective
Figure 1. RAMAC pipeline. The distributional critic Zφ fits the return distribution and aggregates its lower tail into a CVaR signal. That signal is differentiated through the generative actor πθ (diffusion or flow), trained with composite objective ℒπ = ℒBC + η ℒRisk to shift mass away from low-quantile regions while staying on-manifold.
$$\mathcal{L}_\pi(\theta) = \underbrace{\mathcal{L}_{\mathrm{BC}}(\theta)}_{\text{data fidelity}} + \eta\,\underbrace{\mathcal{L}_{\mathrm{Risk}}(\theta)}_{\text{tail-risk aversion}}$$

Eq. 12. The BC term keeps actions on the data manifold and directly bounds OOD probability (Proposition 1). The CVaR term shifts mass away from the worst-α return outcomes.


Theoretical Insight

Why It Works

Prior-anchored perturbation methods like ORAAC cannot fully eliminate out-of-distribution actions: on thin or nonconvex supports, the anchor ball inevitably overlaps the out-of-support region, leaving a strictly positive OOD probability no matter how the residual policy is trained. RAMAC instead applies BC regularization directly to the deployed generative policy — and we prove this yields a hard upper bound on per-state OOD probability.

OOD Probability Bound

For each state s, the per-state OOD probability δsθ) satisfies:

$$\delta_s(\pi_\theta) \;\leq\; 1 - \exp\!\left\{ -D_{\mathrm{KL}}\!\left(\beta(\cdot\mid s)\,\|\,\pi_\theta(\cdot\mid s)\right) \right\}$$

Shrinking the forward-KL via BC directly suppresses per-state OOD probability — with the strength of the effect controlled by η in Eq. 12.


Experiments

We evaluate RADAC on Stochastic-D4RL — six standard D4RL locomotion datasets augmented with rare heavy-tailed penalties across HalfCheetah, Hopper, and Walker2d. Results are averaged over 5 seeds and 50 evaluation rollouts; we report both mean return and episodic CVaR0.1.

2-D Risky Bandit

A 2-D contextual bandit isolates exactly what makes risk-aware offline RL hard: the ground truth has a safe center mode (moderate reward, no catastrophic tail) and a risky ring (higher mean with rare large penalties). This controlled geometry makes multimodality and lower-tail hazard visible — and reveals where each method's policy mass actually concentrates.

Figure pending — see paper

Top: Ground truth consists of a safe center mode (yellow-green) and a risky ring where high-reward samples (yellow) are interspersed with catastrophic penalties (purple). Risk-neutral generative baselines concentrate on the risky ring or collapse topology. Prior-anchored perturbation methods produce samples in the low-density inter-mode region, exhibiting OOD leakage. RADAC concentrates near the safe center without losing multimodality.

Stochastic-D4RL Benchmark

Dataset Metric CQL CODAC ORAAC FlowQL DiffusionQL RADAC
HalfCheetah-ME Mean −66.66 −0.12 796.06 844.14 −20.71 916.64
CVaR −135.39 −0.11 742.94 754.44 −76.39 805.25
Walker2d-ME Mean −21.52 23.96 969.62 1309.48 −32.38 1708.68
CVaR −64.88 −43.88 358.55 468.15 −116.19 573.22
Hopper-ME Mean −25.87 26.59 714.15 341.16 −279.97 130.74
CVaR −111.37 −150.92 374.63 −8.80 −872.95 −167.29
HalfCheetah-MR Mean −66.21 −0.11 18.99 434.33 279.95 525.84
CVaR −127.09 −1.47 −34.09 224.73 79.93 278.65
Walker2d-MR Mean −16.90 33.59 126.94 411.36 96.88 615.94
CVaR −51.49 −52.63 −203.64 5.08 48.14 145.21
Hopper-MR Mean −16.25 −47.83 −18.00 373.16 −2.79 385.58
CVaR −118.70 −160.08 −129.25 −62.24 −51.33 −8.16

Table 1. Stochastic-D4RL results over 5 seeds. Best per row bold; second-best shaded. RADAC achieves consistently stronger tail returns (CVaR0.1) across most tasks.

Full results with standard errors appear in Appendix E.2 of the paper.

OOD Action Rate

RADAC achieves consistently lower out-of-distribution action rates than ORAAC across all tasks, validating that BC regularization on the deployed policy directly suppresses OOD visitation.

Task RADAC (ours) ORAAC
HalfCheetah 2.04 ± 0.80 6.15 ± 1.5
Walker2d 0.75 ± 0.54 10.84 ± 1.98
Hopper 0.77 ± 0.56 2.68 ± 1.01

Table 2. OOD action rate (% ± s.e.) on Medium-Expert (κ=3). RADAC is consistently lower, confirming that BC regularization on the deployed policy suppresses OOD visitation.


Citation

@article{fukazawa2025ramac,
  title   = {RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning
             and the Role of Behavior Regularization},
  author  = {Fukazawa, Kai and Mundada, Kunal and Soltani, Iman},
  journal = {arXiv preprint arXiv:2510.02695},
  year    = {2025}
}