Online RLHF.
Feedback as fast as your model generates.

A continuous post-training system where human preferences are streamed directly into the optimizer. Every training step pushes a group of candidates as a Flow item, gets ranking back in seconds, and turns it into a reward signal.

Book a call See Flows

Example setup

Generation, evaluation, and training run concurrently. Human feedback latency is comparable to image generation time, so people never become the bottleneck.

01the shift

Static datasets cap out. Online loops don’t.

Traditional RLHF collects a preference dataset once, trains a reward model, and runs PPO against it for weeks. The reward model drifts away from the policy after a few hundred steps, and the dataset stops reflecting the things your current model is getting wrong. Online RLHF closes that gap by treating each training step as its own micro-experiment — a small group of candidates, ranked by humans, fed back to the optimizer within seconds.

same wall-clock window→ time

Traditional RLHFstatic dataset · periodic retrain

collect 100k pairs (weeks)reward model trainPPO runevalnext collection

Online RLHFcontinuous · feedback inside the step

one training step = one flow item · ~3-8s end-to-end · hundreds in parallel

feedback latency ≤ image generation time → humans stop being the bottleneck

02how it works

One training step = one Flow item.

For each prompt, the model emits a group of candidates, usually from 4 to 16. The group is submitted as a single Flow item; pairwise comparisons are presented to human annotators in parallel; responses are aggregated with Elo and rankings are prepared.

one flow item

8 candidates · 1 prompt

step 12,408

prompt: “a serene mountain at sunset, cinematic”

C(8, 2) = 28 possible pairs · sampled adaptively to maximize information gain

aggregated ranking

Elo

312 / 500 votes

01
1342
02
1308
03
1284
04
1267
05
1219
06
1198
07
1175
08
1156

preference signal → training

winner: c1 · 7 preference pairs

conf 0.94

Generate a group

Each training step samples 8 candidates per prompt across your parallel GPUs.

Open a Flow item

The group is submitted as one Flow item with min / desired / max response thresholds and a ttl.

Aggregate at scale

Pairwise comparisons stream in from the global crowd; Elo / Bradley-Terry produces a ranking with CI in seconds.

Train on the signal

The winner and full preference pairs become reward-modeling or DPO targets for the next gradient step.

03in code

Drop it into your training step.

One flow stays open for the entire run. Each step pushes a batch and polls back a ranking. Because the call is non-blocking and ttl-bounded, your training loop never waits on humans — incomplete items still return partial results.

Book a call Docs

online_rlhf.py

from rapidata import RapidataClient

client = RapidataClient()

# 1. Open one ranking flow for the entire training run
flow = client.flow.create_ranking_flow(
    name="online-rlhf · image-gen",
    instruction="Which image looks better?",
)

# 2. Inside the training loop — one flow item per step
for step in train_loop:
    candidates = policy.sample(prompt, n=8)        # 8 per prompt
    item = flow.create_new_flow_batch(
        datapoints=candidates,
        context=f"step {step}",
        time_to_live=300,                          # seconds, bounded
    )

    # Non-blocking inspection while votes accumulate
    status  = item.get_status()
    matrix  = item.get_win_loss_matrix()           # pandas DataFrame
    results = item.get_results()                   # rankings + scores

    # 3. Feed the win/loss matrix into your DPO / reward modeling
    optimizer.step(reward_signal=matrix)

Online RLHF.Feedback as fast as your model generates.

Static datasets cap out. Online loops don’t.

One training step = one Flow item.

Drop it into your training step.

Online RLHF.
Feedback as fast as your model generates.