Replace the reward model with real humans

Real-time and continuous human-in-the-loop ranking

Continuous human feedback for RLHF.

Flows continuously route model outputs to human annotators and aggregate pairwise preferences into live reward signals. Collect up to 6K+ human annotations per minute to refresh or replace reward-model signals with direct human feedback. It is lightweight, low-latency, and ttl-bounded so your training step never blocks.

Read the docs

V1 · Swap

Reward model out, humans in.

policy → flow → reward → policy

conventional rlhf

Reward Model

A neural net trained to approximate what humans would say.

swap

with rapidata flows

The humans themselves

247 humans · live

Hundreds of real evaluators score every batch, in the loop.

conventional rlhf

A trained reward model approximates what a human would say.

with rapidata flows

Hundreds of humans actually say it, every batch.

01why flows

Reward models approximate preference. Flows collect it directly.

Reward models enabled major improvements in RLHF by approximating human preference at scale, because collecting feedback from real humans has historically been too slow for continuous optimization. Flows reduce that latency enough to keep humans directly in the optimization loop.

No jobs to spin up

Persistent preference pipelines

Flows stay active across batches, allowing you to continuously stream generations into the same human feedback pipeline.

ttl-bounded

Time-bounded, not blocking

Flows return whatever human feedback is available within a configurable time window, allowing training and evaluation pipelines to continue without blocking.

any modality

Image · video · audio · text

The same flow API ranks generations across modalities. Same instruction, same win-loss matrix, same reward shape.

02how it works

A continuous preference pipeline in four SDK calls.

Create a flow

Define the question shown to evaluators and set your per-item response budget. Min/max thresholds bound the variance of your reward.

flow = client.flow.create_ranking_flow(
    name="Image Quality Ranking",
    instruction="Which image looks better?",
    max_response_threshold=200,
    min_response_threshold=50,
)

Add a flow batch

During training, push a batch of rollouts. Optionally tag with context and a time_to_live so the call is non-blocking.

flow_item = flow.create_new_flow_batch(
    datapoints=rollouts,                  # urls, paths, or text
    context="generations from step 12k",
    time_to_live=300,                     # seconds
)

Get results

Retrieve pairwise preferences, rankings, and response statistics continuously as feedback arrives. Use get_status() to access partial results on the go.

status  = flow_item.get_status()
results = flow_item.get_results()
matrix  = flow_item.get_win_loss_matrix()
count   = flow_item.get_response_count()

Update flow configuration

Rewrite the instruction mid-run. Existing flow items keep their original config; only new items pick up the change.

flow.update_config(
    instruction="Which image has higher visual quality?",
)

03inside a running flow

What the SDK gives back.

Image Quality Ranking

flow_8f3a7c · 5 batches · 600 responses each · ttl 300s each · independent

flow live

batch-014

step 18,200

complete

600/600600/600 responsesclosed

0s75s150s225s300s · ttl

complete

batch-015

step 18,300

complete

600/600600/600 responsesclosed

0s75s150s225s300s · ttl

complete

batch-016

step 18,400

collecting

412/600412/600 responses47s until ttl

0s75s150s225s300s · ttl

collecting

batch-017

step 18,500

collecting

138/600138/600 responses168s until ttl

0s75s150s225s300s · ttl

collecting

batch-018

step 18,600

queued

0/6000/600 responsesawaiting step 18,600

0s75s150s225s300s · ttl

queued

each batch is independent · own ttl · own response budgetany batch that hits ttl returns partial results as incomplete

flow_item.get_win_loss_matrix()response_count: 312

gen-001

gen-002

gen-003

gen-004

gen-001

gen-002

gen-003

gen-004

rows: preferred · cols: compared against · cell: # of pairwise wins

04rest of the surface

Other utilities.

When you know a training run is about to hit a high-cadence phase, you can preheat ahead of time so the first batch returns with the same latency as the hundredth. Other utilities allow you to retrieve flows you previously created, list your recent flows & so on.

Tip

Call client.flow.preheat() about 5 minutes before a latency-sensitive sequence of batches.

flow_api.py

# Warm up internal resources before a hot phase
client.flow.preheat()

# Retrieve a flow you created earlier
flow = client.flow.get_flow_by_id("flow_8f3a7c...")

# List your recent flows
recent = client.flow.find_flows(amount=10)

# All flow items for a flow
all_items = flow.get_flow_items()

# Tear it down
flow.delete()

05use it

Enable Online RLHF loops.

Reward models approximate preference. Flows let you collect fresh human feedback at training cadence, so RLHF systems can optimize directly against the signal they were designed to model.

Online RLHF Schedule a demo

image

Realism · Coherence · Composition

video

Motion · Consistency· Scene stability

audio

Clarity · Naturalness · Tone

text

Alignment · Reasoning · Style