Online RLHF.
Feedback as fast as your model generates.
A continuous post-training system where human preferences are streamed directly into the optimizer. Every training step pushes a group of candidates as a Flow item, gets ranking back in seconds, and turns it into a reward signal.
Static datasets cap out. Online loops don’t.
Traditional RLHF collects a preference dataset once, trains a reward model, and runs PPO against it for weeks. The reward model drifts away from the policy after a few hundred steps, and the dataset stops reflecting the things your current model is getting wrong. Online RLHF closes that gap by treating each training step as its own micro-experiment — a small group of candidates, ranked by humans, fed back to the optimizer within seconds.
One training step = one Flow item.
For each prompt, the model emits a group of candidates, usually from 4 to 16. The group is submitted as a single Flow item; pairwise comparisons are presented to human annotators in parallel; responses are aggregated with Elo and rankings are prepared.
- 011342
- 021308
- 031284
- 041267
- 051219
- 061198
- 071175
- 081156
Each training step samples 8 candidates per prompt across your parallel GPUs.
The group is submitted as one Flow item with min / desired / max response thresholds and a ttl.
Pairwise comparisons stream in from the global crowd; Elo / Bradley-Terry produces a ranking with CI in seconds.
The winner and full preference pairs become reward-modeling or DPO targets for the next gradient step.
Drop it into your training step.
One flow stays open for the entire run. Each step pushes a batch and polls back a ranking. Because the call is non-blocking and ttl-bounded, your training loop never waits on humans — incomplete items still return partial results.
from rapidata import RapidataClient
client = RapidataClient()
# 1. Open one ranking flow for the entire training run
flow = client.flow.create_ranking_flow(
name="online-rlhf · image-gen",
instruction="Which image looks better?",
)
# 2. Inside the training loop — one flow item per step
for step in train_loop:
candidates = policy.sample(prompt, n=8) # 8 per prompt
item = flow.create_new_flow_batch(
datapoints=candidates,
context=f"step {step}",
time_to_live=300, # seconds, bounded
)
# Non-blocking inspection while votes accumulate
status = item.get_status()
matrix = item.get_win_loss_matrix() # pandas DataFrame
results = item.get_results() # rankings + scores
# 3. Feed the win/loss matrix into your DPO / reward modeling
optimizer.step(reward_signal=matrix)