RL DFX Metrics

导言

RL 训练的指标不能只看 reward、loss 和 throughput。真正可用的 DFX 体系，需要同时解释 正确性、稳定性、显存、性能、负载均衡和数据质量。

1. 为什么 RL 需要专门的 DFX¶

rollout、reward、logprob、ref、actor update 是多阶段流水，不是单一 forward/backward。
一次 step 内既有推理，也有训练，还有数据重排、通信、checkpoint 和异步调度。
精度曲线对齐并不代表系统特性已经被验证，很多性能和稳定性问题会被平均指标掩盖。

2. 指标分类¶

2.1 正确性指标¶

shape checksum
mask ratio
token alignment
sample / group id 对齐
NaN / Inf 计数

2.2 稳定性指标¶

actor/entropy
actor/ppo_kl
actor/kl_loss
actor/grad_norm
actor/pg_clipfrac
reward mean / max / min

2.3 性能指标¶

step time
stage time
tokens/s
samples/s
MFU
SMA
pipeline bubble

2.4 显存指标¶

allocated memory
reserved memory
fragmentation ratio
activation peak
KV cache peak
communication buffer peak

2.5 负载均衡指标¶

per-rank token count
per-rank active time
per-rank idle time
queue depth
straggler rank

2.6 数据质量指标¶

prompt length mean / max / min
response length mean / max / min
clip ratio
aborted ratio
reward distribution

3. 分阶段指标设计¶

3.1 rollout¶

prefill tokens
decode tokens
request latency
TTFT / TPOT
KV cache utilization
abort / timeout ratio

3.2 reward¶

reward latency
reward batch tokens
rule reward exception count
reward model throughput

3.3 old / ref logprob¶

micro batch token 数
forward time
activation peak
logprob shape
response mask ratio

3.4 actor update¶

forward time
backward time
optimizer time
communication time
grad norm
loss decomposition

4. 告警规则草案¶

组合告警优先

单个指标异常不一定代表系统坏了。比如 reward 抖动可能只是数据方差，但 KL spike + grad norm spike + entropy collapse 同时出现，就应该视为强告警。

训练发散：KL、grad norm、entropy 同时异常。
显存风险：reserved 接近物理上限，且 allocated / reserved 差距扩大。
动态 batch 失控：单 step token 数异常增加。
多卡不均衡：per-rank active time 或 token count 偏差过大。
推理异常：abort ratio、timeout ratio 或 response clip ratio 增大。

5. Dashboard 草图¶

5.1 E2E timeline¶

展示 rollout -> reward -> logprob/ref -> update -> checkpoint 的时间拆分。

5.2 per-rank heatmap¶

展示每张卡的 token 数、active time、idle time 和通信时间。

5.3 memory timeline¶

展示 allocated、reserved、KV cache、activation peak 的时间线。

5.4 training stability panel¶

展示 reward、KL、entropy、grad norm、clipfrac 的联动。

6. 待验证项¶

verl 当前日志中已有指标的准确口径。
Ascend 环境下 MFU / SMA 计算公式与数据源。
是否能在 DataProto 或 worker 边界自动记录 shape ledger。
是否能在异步模式下补充 queue depth、policy lag、stale sample ratio。