LLM Models Comparison
Benchmarks MMLU, HumanEval, MATH, GSM8K, IFEval, BFCL · April 2026
Benchmark Results
| Model | MMLU | HumanEval | MATH | GSM8K | IFEval | BFCL | Input | Output |
|---|
Reference models (highlighted) serve as quality benchmarks. Values marked "~" are approximate from public sources.
Performance Visualization
Cost Analysis
| Model | MMLU $/1% | HumanEval $/1% | MATH $/1% | Verdict |
|---|
Key Insights
Conclusions
Related Project
Lightweight CLI tool that switches Claude Code's AI backend between free cloud models via Ollama and DeepSeek-V4 via DeepSeek API. Perfect for toggling between free and paid providers.
Checks dependencies
Auto-detects providers
One-command switch
Version validation