Continuous human feedback for RLHF.
Flows continuously route model outputs to human annotators and aggregate pairwise preferences into live reward signals. Collect up to 6K+ human annotations per minute to refresh or replace reward-model signals with direct human feedback. It is lightweight, low-latency, and ttl-bounded so your training step never blocks.
Reward model out, humans in.
Reward Model
A neural net trained to approximate what humans would say.
The humans themselves
Hundreds of real evaluators score every batch, in the loop.
A trained reward model approximates what a human would say.
Hundreds of humans actually say it, every batch.
Reward models approximate preference. Flows collect it directly.
Reward models enabled major improvements in RLHF by approximating human preference at scale, because collecting feedback from real humans has historically been too slow for continuous optimization. Flows reduce that latency enough to keep humans directly in the optimization loop.
Persistent preference pipelines
Flows stay active across batches, allowing you to continuously stream generations into the same human feedback pipeline.
Time-bounded, not blocking
Flows return whatever human feedback is available within a configurable time window, allowing training and evaluation pipelines to continue without blocking.
Image · video · audio · text
The same flow API ranks generations across modalities. Same instruction, same win-loss matrix, same reward shape.
A continuous preference pipeline in four SDK calls.
Create a flow
Define the question shown to evaluators and set your per-item response budget. Min/max thresholds bound the variance of your reward.
flow = client.flow.create_ranking_flow(
name="Image Quality Ranking",
instruction="Which image looks better?",
max_response_threshold=200,
min_response_threshold=50,
)Add a flow batch
During training, push a batch of rollouts. Optionally tag with context and a time_to_live so the call is non-blocking.
flow_item = flow.create_new_flow_batch(
datapoints=rollouts, # urls, paths, or text
context="generations from step 12k",
time_to_live=300, # seconds
)Get results
Retrieve pairwise preferences, rankings, and response statistics continuously as feedback arrives. Use get_status() to access partial results on the go.
status = flow_item.get_status() results = flow_item.get_results() matrix = flow_item.get_win_loss_matrix() count = flow_item.get_response_count()
Update flow configuration
Rewrite the instruction mid-run. Existing flow items keep their original config; only new items pick up the change.
flow.update_config(
instruction="Which image has higher visual quality?",
)What the SDK gives back.
Other utilities.
When you know a training run is about to hit a high-cadence phase, you can preheat ahead of time so the first batch returns with the same latency as the hundredth. Other utilities allow you to retrieve flows you previously created, list your recent flows & so on.
# Warm up internal resources before a hot phase
client.flow.preheat()
# Retrieve a flow you created earlier
flow = client.flow.get_flow_by_id("flow_8f3a7c...")
# List your recent flows
recent = client.flow.find_flows(amount=10)
# All flow items for a flow
all_items = flow.get_flow_items()
# Tear it down
flow.delete()Enable Online RLHF loops.
Reward models approximate preference. Flows let you collect fresh human feedback at training cadence, so RLHF systems can optimize directly against the signal they were designed to model.