Solutions · Online RLHF
continuous post-training · humans inside the loop

Online RLHF.
Feedback as fast as your model generates.

A continuous post-training system where human preferences are streamed directly into the optimizer. Every training step pushes a group of candidates as a Flow item, gets ranking back in seconds, and turns it into a reward signal.

the inner loop
Policy~256 GPUsGeneration8 candidates / promptFlow itempairwise comparisonsGlobal crowd6K+ ann / minElo / BTranking + CIReward signalpreference pairsthroughput6,320 ann/minflow items live323
01the shift

Static datasets cap out. Online loops don’t.

Traditional RLHF collects a preference dataset once, trains a reward model, and runs PPO against it for weeks. The reward model drifts away from the policy after a few hundred steps, and the dataset stops reflecting the things your current model is getting wrong. Online RLHF closes that gap by treating each training step as its own micro-experiment — a small group of candidates, ranked by humans, fed back to the optimizer within seconds.

same wall-clock window→ time
Traditional RLHFstatic dataset · periodic retrain
collect 100k pairs (weeks)reward model trainPPO runevalnext collection
Online RLHFcontinuous · feedback inside the step
one training step = one flow item · ~3-8s end-to-end · hundreds in parallel
feedback latency ≤ image generation time → humans stop being the bottleneck
02how it works

One training step = one Flow item.

For each prompt, the model emits a group of candidates, usually from 4 to 16. The group is submitted as a single Flow item; pairwise comparisons are presented to human annotators in parallel; responses are aggregated with Elo and rankings are prepared.

one flow item
8 candidates · 1 prompt
step 12,408
prompt: “a serene mountain at sunset, cinematic”
c1
c2
c3
c4
c5
c6
c7
c8
C(8, 2) = 28 possible pairs · sampled adaptively to maximize information gain
aggregated ranking
Elo
312 / 500 votes
  • 01
    1342
  • 02
    1308
  • 03
    1284
  • 04
    1267
  • 05
    1219
  • 06
    1198
  • 07
    1175
  • 08
    1156
preference signal → training
winner: c1 · 7 preference pairs
conf 0.94
01
Generate a group

Each training step samples 8 candidates per prompt across your parallel GPUs.

02
Open a Flow item

The group is submitted as one Flow item with min / desired / max response thresholds and a ttl.

03
Aggregate at scale

Pairwise comparisons stream in from the global crowd; Elo / Bradley-Terry produces a ranking with CI in seconds.

04
Train on the signal

The winner and full preference pairs become reward-modeling or DPO targets for the next gradient step.

03in code

Drop it into your training step.

One flow stays open for the entire run. Each step pushes a batch and polls back a ranking. Because the call is non-blocking and ttl-bounded, your training loop never waits on humans — incomplete items still return partial results.

online_rlhf.py
from rapidata import RapidataClient

client = RapidataClient()

# 1. Open one ranking flow for the entire training run
flow = client.flow.create_ranking_flow(
    name="online-rlhf · image-gen",
    instruction="Which image looks better?",
)

# 2. Inside the training loop — one flow item per step
for step in train_loop:
    candidates = policy.sample(prompt, n=8)        # 8 per prompt
    item = flow.create_new_flow_batch(
        datapoints=candidates,
        context=f"step {step}",
        time_to_live=300,                          # seconds, bounded
    )

    # Non-blocking inspection while votes accumulate
    status  = item.get_status()
    matrix  = item.get_win_loss_matrix()           # pandas DataFrame
    results = item.get_results()                   # rankings + scores

    # 3. Feed the win/loss matrix into your DPO / reward modeling
    optimizer.step(reward_signal=matrix)