Benchmark models with

Model Rank Insights

Define a fixed prompt set, create one leaderboard per criterion, and submit outputs from your models. MRI routes pairwise comparisons to human evaluators, aggregates the results into Elo-style standings, and updates the leaderboard as new checkpoints are added.

  • 01
    GPT-Image 2
    1223
  • 02
    GPT-Image 1.5
    1218
  • 03
    Nano-Banana 2
    1111
  • 04
    Seedream 4
    1100
  • 05
    FLUX.2 [flex]
    1090
  • 06
    FLUX.2 [pro]
    1077
View full leaderboard
100k+
responses / hour
Starting at $4
per 1000 responses
32M+
curated annotators worldwide
Global reach
filter by language, demographics, skills...
01how it works

Compare checkpoints on the same prompt sets.

MRI compares model outputs prompt-by-prompt through human pairwise evaluations. Each evaluation criterion gets its own leaderboard, so you can track how checkpoints rank on alignment, realism, motion quality, artifacts, or overall preference.

  • Active matchup selection (max info gain)
  • Incremental Elo updates
  • Per-criterion confidence intervals
  • Historic checkpoint diff in one query
modelsn = 5
  • Release_2.0
  • Candidate_2.3_01
  • Baseline 1
  • Release_2.2
  • Candidate_2.3_02
promptsbenchmark · 200
  • "A serene mountain landscape"
  • "A futuristic city at dusk"
  • "A wise old wizard, portrait"
  • "A car along a coastal road"
generated images5 models × 200 prompts
Mountain landscape sample from Release_2.0
Release_2.0
Mountain landscape sample from Candidate_2.3_01
Candidate_2.3_01
Mountain landscape sample from Baseline 1
Baseline 1
Mountain landscape sample from Release_2.2
Release_2.2
Mountain landscape sample from Candidate_2.3_02
Candidate_2.3_02
ask · global crowdWhich image matches the description?
ask · global crowdWhich image is more coherent?
ask · global crowdWhich image do you prefer?
Alignment
#modelelo
1
Release_2.2
1291
2
Candidate_2.3_01
1237
3
Release_2.0
1174
4
Candidate_2.3_02
1129
5
Baseline 1
1062
Coherence
#modelelo
1
Candidate_2.3_01
1237
2
Release_2.2
1197
3
Candidate_2.3_02
1162
4
Baseline 1
1100
5
Release_2.0
1051
Preference
#modelelo
1
Release_2.2
1284
2
Candidate_2.3_02
1226
3
Candidate_2.3_01
1181
4
Baseline 1
1145
5
Release_2.0
1023
live · elo scores update on every response
02when to use

When automated judges fail on subtle visual differences.

MRI is commonly used to benchmark prompt following, realism, motion quality, temporal consistency, edit preservation, and artifact detection across image and video generation models.

TI
Text → image"A serene mountain landscape"
TV
Text → video"A car driving along the coast, 4s"
II
Image → imageedit · inpaint · style transfer
IV
Image → videoanimate a still · loop
03features

Everything needed to run human model benchmarks

Everything below ships in rapidata SDK. No additional setup, no hand-rolled annotation pipelines.

01Quantify soft metrics

Aesthetics, coherence, alignment, capture preferences that don't reduce to a single scalar in your reward model.

02Smart matchups

Active selection of comparisons maximizes information gain per response, ~6× fewer judgments to reach the same confidence as random pairings.

03Auto-updating rankings

Submit a checkpoint, walk away. New responses arrive within minutes and the leaderboard recomputes ELO scores incrementally.

04Historic checkpoint tracking

Every model you've submitted stays in the ranking, regressions on one criterion show up the moment the run completes.

05Non-blocking calls

Fire-and-forget submission from a training loop. Pull results when you need them; nothing in your pipeline waits on humans.

06Bring your own prompts

Use Rapidata's benchmark sets or replace them with your own evaluation prompts.

04how to use it

Directly from the SDK.

Define a benchmark, attach criteria, submit a checkpoint. The leaderboard updates in the background. Drop it inside your training loop and let the ranking accrue.

mri_quickstart.py
from rapidata import RapidataClient

client = RapidataClient()

PROMPTS = [
    "A serene mountain landscape at sunset",
    "A futuristic city with flying cars",
    "A portrait of a wise old wizard",
]

# 1. Define a benchmark with your prompts
benchmark = client.mri.create_new_benchmark(
    name="text-to-image v3",
    prompts=PROMPTS,
)

# 2. Add a leaderboard — one criterion, one instruction
leaderboard = benchmark.create_leaderboard(
    name="Aesthetic preference",
    instruction="Which image looks better?",
    show_prompt=False,
)

# 3. Evaluate a checkpoint. Non-blocking.
benchmark.evaluate_model(
    name="checkpoint-2026-05-08",
    media=["./out/p1.png", "./out/p2.png", "./out/p3.png"],
    prompts=PROMPTS,
)

standings = leaderboard.get_standings()
$pip install rapidata
get started

See what model humans actually prefer.

Send model outputs to Rapidata and have a quantified ranking on aesthetics, coherence, and alignment within the hour.

pricing — starts at
$4/ 1,000 responses

Pricing scales with throughput and task complexity. Transparent and predictable price per response.

Millions of responses delivered per day.