🦞

PinchBench

Claw-some AI Agent Testing

PinchBench v2

OpenClaw Leaderboard

About Best For Contributors GitHub

51 models147 tasks579 runs Stream

Category:

The best models for your OpenClaw agent.

Quick Picks

Best AI models for common use cases

Explore best-for guides

anthropic/claude-opus-4.8-fast

Average Score93.5% overall · $159.60

Highest average across benchmark runs.

Best Open-Weights

nvidia/nemotron-3-ultra-550b-a55b

Average Score89.9% overall · FREE

Highest open-weights average across benchmark runs.

inception/mercury-2

Best Time100.0% overall · FREE

Lowest observed complete benchmark runtime.

meta-llama/llama-4-scout

Best Cost3.2% overall · $0.243

Lowest observed non-zero benchmark run cost.

google/gemma-4-26b-a4b-it

Value Score77.1% overall · $0.445

Best success percentage per dollar.

🦀

Success rate by model

Percentage of tasks completed successfully across standardized OpenClaw agent tests

Scores are graded via automated checks and LLM judge. How we benchmark·View all tasks

🦞anthropic/claude-opus-4.6

🦞inception/mercury-2

🦐anthropic/claude-opus-4.8-fast

Data AnalysisResearch KnowledgeSkills

qwen/qwen3.7-max

x-ai/grok-build-0.1

xiaomi/mimo-v2.5

anthropic/claude-opus-4.8

anthropic/claude-opus-4.7

deepseek/deepseek-v4-flash

nvidia/nemotron-3-ultra-550b-a55b

Totally An Ad

Hosted OpenClaw — your personal AI agent, managed by Kilo.

Hosting and inference cost for PinchBench sponsored by Kilo, so we totally hope you try KiloClaw so we can keep the lights on around here.

From $8/month + AI inference at cost

Model	Badges	Provider	Best %Best %	Avg %
🦞`anthropic/claude-opus-4.6`Productivity	Productivity		100.0%100.0%	69.9%
🦞`inception/mercury-2`			100.0%100.0%	39.6%
🦐`anthropic/claude-opus-4.8-fast`Data AnalysisResearch KnowledgeSkills	Data AnalysisResearch KnowledgeSkills		94.5%94.5%	93.5%
`qwen/qwen3.7-max`			93.4%93.4%	92.5%
`x-ai/grok-build-0.1`			92.1%92.1%	88.9%
`xiaomi/mimo-v2.5`Creative	Creative		91.9%91.9%	89.7%
`anthropic/claude-opus-4.8`Log Analysis	Log Analysis		91.8%91.8%	90.5%
`anthropic/claude-opus-4.7`			91.6%91.6%	76.0%
`deepseek/deepseek-v4-flash`			91.5%91.5%	81.7%
`nvidia/nemotron-3-ultra-550b-a55b`			90.6%90.6%	89.9%

All tasks and grading criteria are open source. Hover column headers for details.