Open Model Index

Benchmarks. GGUF. VRAM. What's it good at.

GitHub
Arch
Size
Sort
GPU Memory No limit
Context
Loading models...

Data Sources

Every score is cited. Click a source name in any benchmark table to jump here.

SWE-bench

Software engineering benchmark from Princeton NLP. Measures % of real GitHub issues a model can solve. Agent-dependent — we take best score per model.

Verified

Open LLM Leaderboard

HuggingFace evaluation harness. Provides IFEval, GPQA Diamond, and MMLU-PRO scores. Note: evaluates without chain-of-thought, so reasoning models may score lower than expected.

Verified

LiveBench

Contamination-free benchmarks from Abacus.AI. Questions are refreshed monthly so models can't memorize answers. Provides coding and reasoning scores.

Verified

BFCL

Berkeley Function Calling Leaderboard. Tests tool use and function calling ability. We use the overall accuracy after stripping FC/Prompt suffixes.

Verified

llm-stats.com

TAU-bench Retail and Airline scores. Measures agent ability in realistic customer service scenarios.

Verified

Chatbot Arena

Human preference ELO ratings from blind A/B comparisons. Independent of benchmarks — captures real user satisfaction. Data via community API mirror.

Verified

Artificial Analysis

Quality index: composite intelligence score aggregating 10 benchmarks (agents, coding, general, scientific reasoning). Attribution required per API terms.

Verified

HuggingFace Model Cards

Benchmark scores reported by the model creator in the HF README. Parsed from HTML tables, markdown tables, or inline text. These are self-reported — not independently verified.

Reported

Model Card (Comparison)

Scores pulled from comparison tables on other models' cards. Higher risk of misattribution — the parser checks model names but unusual layouts may need manual review.

Reported

HuggingFace API

Model metadata: parameter counts, license, architecture details, GGUF quantization files and sizes. Used for VRAM estimation.

Metadata

Confidence Tiers

Verified — confirmed by an independent leaderboard or evaluation harness

Reported — from the model creator's own benchmarks

Missing — no data available from any source

The pipeline never overwrites higher-confidence data with lower.