LLM Models Comparison

Benchmarks MMLU, HumanEval, MATH, GSM8K, IFEval, BFCL · April 2026

Benchmark Results

Model MMLU HumanEval MATH GSM8K IFEval BFCL Input Output

Reference models (highlighted) serve as quality benchmarks. Values marked "~" are approximate from public sources.

Performance Visualization

Cost Analysis

Model MMLU $/1% HumanEval $/1% MATH $/1% Verdict

Key Insights

Conclusions

Related Project

Lightweight CLI tool that switches Claude Code's AI backend between free cloud models via Ollama and DeepSeek-V4 via DeepSeek API. Perfect for toggling between free and paid providers.

Checks dependencies Auto-detects providers One-command switch Version validation

Sources