Benchmark models with

Model Rank Insights

Define a fixed prompt set, create one leaderboard per criterion, and submit outputs from your models. MRI routes pairwise comparisons to human evaluators, aggregates the results into Elo-style standings, and updates the leaderboard as new checkpoints are added.

Schedule a demo Read the docs

01
GPT-Image 2OpenAI
1223
02
GPT-Image 1.5OpenAI
1218
03
Nano-Banana 2Google DeepMind
1111
04
Seedream 4ByteDance
1100
05
FLUX.2 [flex]Black Forest Labs
1090
06
FLUX.2 [pro]Black Forest Labs
1077

View full leaderboardelo · 95% ci ±3

100k+

responses / hour

Starting at $4

per 1000 responses

32M+

curated annotators worldwide

Global reach

filter by language, demographics, skills...

01how it works

Compare checkpoints on the same prompt sets.

MRI compares model outputs prompt-by-prompt through human pairwise evaluations. Each evaluation criterion gets its own leaderboard, so you can track how checkpoints rank on alignment, realism, motion quality, artifacts, or overall preference.

Active matchup selection (max info gain)
Incremental Elo updates
Per-criterion confidence intervals
Historic checkpoint diff in one query

modelsn = 5

Release_2.0
Candidate_2.3_01
Baseline 1
Release_2.2
Candidate_2.3_02

promptsbenchmark · 200

"A serene mountain landscape"
"A futuristic city at dusk"
"A wise old wizard, portrait"
"A car along a coastal road"
…

generated images5 models × 200 prompts

Release_2.0

Candidate_2.3_01

Baseline 1

Release_2.2

Candidate_2.3_02

ask · global crowdWhich image matches the description?

ask · global crowdWhich image is more coherent?

ask · global crowdWhich image do you prefer?

Alignment

#modelelo

Release_2.2

1291

Candidate_2.3_01

1237

Release_2.0

1174

Candidate_2.3_02

1129

Baseline 1

1062

Coherence

#modelelo

Candidate_2.3_01

1237

Release_2.2

1197

Candidate_2.3_02

1162

Baseline 1

1100

Release_2.0

1051

Preference

#modelelo

Release_2.2

1284

Candidate_2.3_02

1226

Candidate_2.3_01

1181

Baseline 1

1145

Release_2.0

1023

live · elo scores update on every response

02when to use

When automated judges fail on subtle visual differences.

MRI is commonly used to benchmark prompt following, realism, motion quality, temporal consistency, edit preservation, and artifact detection across image and video generation models.

Text → image"A serene mountain landscape"

Text → video"A car driving along the coast, 4s"

Image → imageedit · inpaint · style transfer

Image → videoanimate a still · loop

03features

Everything needed to run human model benchmarks

Everything below ships in rapidata SDK. No additional setup, no hand-rolled annotation pipelines.

01Quantify soft metrics

Aesthetics, coherence, alignment, capture preferences that don't reduce to a single scalar in your reward model.

02Smart matchups

Active selection of comparisons maximizes information gain per response, ~6× fewer judgments to reach the same confidence as random pairings.

03Auto-updating rankings

Submit a checkpoint, walk away. New responses arrive within minutes and the leaderboard recomputes ELO scores incrementally.

04Historic checkpoint tracking

Every model you've submitted stays in the ranking, regressions on one criterion show up the moment the run completes.

05Non-blocking calls

Fire-and-forget submission from a training loop. Pull results when you need them; nothing in your pipeline waits on humans.

06Bring your own prompts

Use Rapidata's benchmark sets or replace them with your own evaluation prompts.

04how to use it

Directly from the SDK.

Define a benchmark, attach criteria, submit a checkpoint. The leaderboard updates in the background. Drop it inside your training loop and let the ranking accrue.

mri_quickstart.py

from rapidata import RapidataClient

client = RapidataClient()

PROMPTS = [
    "A serene mountain landscape at sunset",
    "A futuristic city with flying cars",
    "A portrait of a wise old wizard",
]

# 1. Define a benchmark with your prompts
benchmark = client.mri.create_new_benchmark(
    name="text-to-image v3",
    prompts=PROMPTS,
)

# 2. Add a leaderboard — one criterion, one instruction
leaderboard = benchmark.create_leaderboard(
    name="Aesthetic preference",
    instruction="Which image looks better?",
    show_prompt=False,
)

# 3. Evaluate a checkpoint. Non-blocking.
benchmark.evaluate_model(
    name="checkpoint-2026-05-08",
    media=["./out/p1.png", "./out/p2.png", "./out/p3.png"],
    prompts=PROMPTS,
)

standings = leaderboard.get_standings()

$pip install rapidata

get started

See what model humans actually prefer.

Send model outputs to Rapidata and have a quantified ranking on aesthetics, coherence, and alignment within the hour.

Schedule a demo Create your account

pricing — starts at

$4/ 1,000 responses

Pricing scales with throughput and task complexity. Transparent and predictable price per response.

Millions of responses delivered per day.