Model Rank Insights
Define a fixed prompt set, create one leaderboard per criterion, and submit outputs from your models. MRI routes pairwise comparisons to human evaluators, aggregates the results into Elo-style standings, and updates the leaderboard as new checkpoints are added.
- 011223GPT-Image 2OpenAI
- 021218GPT-Image 1.5OpenAI
- 031111Nano-Banana 2Google DeepMind
- 041100Seedream 4ByteDance
- 051090FLUX.2 [flex]Black Forest Labs
- 061077FLUX.2 [pro]Black Forest Labs
Compare checkpoints on the same prompt sets.
MRI compares model outputs prompt-by-prompt through human pairwise evaluations. Each evaluation criterion gets its own leaderboard, so you can track how checkpoints rank on alignment, realism, motion quality, artifacts, or overall preference.
- Active matchup selection (max info gain)
- Incremental Elo updates
- Per-criterion confidence intervals
- Historic checkpoint diff in one query
- Release_2.0
- Candidate_2.3_01
- Baseline 1
- Release_2.2
- Candidate_2.3_02
- "A serene mountain landscape"
- "A futuristic city at dusk"
- "A wise old wizard, portrait"
- "A car along a coastal road"
- …





When automated judges fail on subtle visual differences.
MRI is commonly used to benchmark prompt following, realism, motion quality, temporal consistency, edit preservation, and artifact detection across image and video generation models.
Everything needed to run human model benchmarks
Everything below ships in rapidata SDK. No additional setup, no hand-rolled annotation pipelines.
Aesthetics, coherence, alignment, capture preferences that don't reduce to a single scalar in your reward model.
Active selection of comparisons maximizes information gain per response, ~6× fewer judgments to reach the same confidence as random pairings.
Submit a checkpoint, walk away. New responses arrive within minutes and the leaderboard recomputes ELO scores incrementally.
Every model you've submitted stays in the ranking, regressions on one criterion show up the moment the run completes.
Fire-and-forget submission from a training loop. Pull results when you need them; nothing in your pipeline waits on humans.
Use Rapidata's benchmark sets or replace them with your own evaluation prompts.
Directly from the SDK.
Define a benchmark, attach criteria, submit a checkpoint. The leaderboard updates in the background. Drop it inside your training loop and let the ranking accrue.
from rapidata import RapidataClient
client = RapidataClient()
PROMPTS = [
"A serene mountain landscape at sunset",
"A futuristic city with flying cars",
"A portrait of a wise old wizard",
]
# 1. Define a benchmark with your prompts
benchmark = client.mri.create_new_benchmark(
name="text-to-image v3",
prompts=PROMPTS,
)
# 2. Add a leaderboard — one criterion, one instruction
leaderboard = benchmark.create_leaderboard(
name="Aesthetic preference",
instruction="Which image looks better?",
show_prompt=False,
)
# 3. Evaluate a checkpoint. Non-blocking.
benchmark.evaluate_model(
name="checkpoint-2026-05-08",
media=["./out/p1.png", "./out/p2.png", "./out/p3.png"],
prompts=PROMPTS,
)
standings = leaderboard.get_standings()See what model humans actually prefer.
Send model outputs to Rapidata and have a quantified ranking on aesthetics, coherence, and alignment within the hour.
Pricing scales with throughput and task complexity. Transparent and predictable price per response.