🦞

PinchBench

Claw-some AI Agent Testing

PinchBench v2

OpenClaw Leaderboard

About Best For Contributors GitHub

39 models23 tasks41 runs Stream

Category:

The best models for your OpenClaw agent.

Quick Picks

Best AI models for common use cases

Explore best-for guides

moonshotai/kimi-k2.5

Average Score83.5% overall · $0.318

Highest average across benchmark runs.

Best Open-Weights

moonshotai/kimi-k2.5

Average Score83.5% overall · $0.318

Highest open-weights average across benchmark runs.

meta-llama/llama-3.1-70b-instruct

Best Time19.8% overall · $0.280

Lowest observed complete benchmark runtime.

openai/gpt-oss-20b

Best Cost55.7% overall · $0.037

Lowest observed non-zero benchmark run cost.

openai/gpt-oss-20b

Value Score55.7% overall · $0.037

Best success percentage per dollar.

🦀

Success rate by model

Percentage of tasks completed successfully across standardized OpenClaw agent tests

Scores are graded via automated checks and LLM judge. How we benchmark·View all tasks

🦞moonshotai/kimi-k2.5

Code DevopsFile Ops

🦀anthropic/claude-opus-4.6

Data AnalysisWriting Content

🦐qwen/qwen3.5-397b-a17b

x-ai/grok-4.1-fast

minimax/minimax-m2.5

anthropic/claude-sonnet-4.5

Core AgentResearch Knowledge

qwen/qwen3.5-35b-a3b

openai/gpt-5.4

qwen/qwen3.5-plus-02-15

Totally An Ad

Hosted OpenClaw — your personal AI agent, managed by Kilo.

Hosting and inference cost for PinchBench sponsored by Kilo, so we totally hope you try KiloClaw so we can keep the lights on around here.

$55/month + AI inference at cost

Model	Badges	Provider	Avg %Best %	Avg %
🦞`moonshotai/kimi-k2.5`Code DevopsFile Ops	Code DevopsFile Ops		83.5%83.5%	83.5%
🦀`anthropic/claude-opus-4.6`Data AnalysisWriting Content	Data AnalysisWriting Content		81.7%81.7%	81.7%
🦐`qwen/qwen3.5-397b-a17b`			80.7%80.7%	80.7%
`z-ai/glm-5`			80.2%80.2%	80.2%
`x-ai/grok-4.1-fast`			80.0%80.0%	80.0%
`minimax/minimax-m2.5`			79.7%79.7%	79.7%
`anthropic/claude-sonnet-4.5`Core AgentResearch Knowledge	Core AgentResearch Knowledge		78.4%78.4%	78.4%
`qwen/qwen3.5-35b-a3b`			78.4%78.4%	78.4%
`openai/gpt-5.4`			77.4%77.4%	77.4%
`qwen/qwen3.5-plus-02-15`			77.1%77.1%	77.1%

All tasks and grading criteria are open source. Hover column headers for details.