🦞

PinchBench

Claw-some AI Agent Testing

PinchBench v2

OpenClaw Leaderboard

About Best For Contributors GitHub

43 models147 tasks475 runs Stream

Category:

The best models for your OpenClaw agent.

Quick Picks

Best AI models for common use cases

Explore best-for guides

anthropic/claude-opus-4.7

Score91.6% overall · $38.04

Highest verified success rate across the benchmark.

inception/mercury-2

Best Time28.4% overall · $0.466

Lowest observed complete benchmark runtime.

meta-llama/llama-4-scout

Best Cost3.2% overall · $0.243

Lowest observed non-zero benchmark run cost.

google/gemma-4-26b-a4b-it

Value Score77.1% overall · $0.445

Best success percentage per dollar.

🦀

Success rate by model

Percentage of tasks completed successfully across standardized OpenClaw agent tests

Scores are graded via automated checks and LLM judge. How we benchmark·View all tasks

🦞anthropic/claude-opus-4.7

Log AnalysisProductivitySkills

🦀xiaomi/mimo-v2.5

CreativeWriting Content

🦐anthropic/claude-haiku-4.5

Research Knowledge

xiaomi/mimo-v2.5-pro

deepseek/deepseek-v4-flash

Meeting Analysis

openai/gpt-5.5

anthropic/claude-opus-4.6

openai/gpt-5.4

z-ai/glm-5-turbo

z-ai/glm-5v-turbo

Totally An Ad

Hosted OpenClaw — your personal AI agent, managed by Kilo.

Hosting and inference cost for PinchBench sponsored by Kilo, so we totally hope you try KiloClaw so we can keep the lights on around here.

From $8/month + AI inference at cost

Model	Badges	Provider	Best %Best %	Avg %
🦞`anthropic/claude-opus-4.7`Log AnalysisProductivitySkills	Log AnalysisProductivitySkills		91.6%91.6%	74.9%
🦀`xiaomi/mimo-v2.5`CreativeWriting Content	CreativeWriting Content		91.4%91.4%	89.2%
🦐`anthropic/claude-haiku-4.5`Research Knowledge	Research Knowledge		90.4%90.4%	66.4%
`xiaomi/mimo-v2.5-pro`Code Devops	Code Devops		89.5%89.5%	87.7%
`deepseek/deepseek-v4-flash`Meeting Analysis	Meeting Analysis		89.5%89.5%	80.1%
`openai/gpt-5.5`Core Agent	Core Agent		89.0%89.0%	75.5%
`anthropic/claude-opus-4.6`			88.9%88.9%	71.7%
`openai/gpt-5.4`			88.4%88.4%	76.4%
`z-ai/glm-5-turbo`Data Analysis	Data Analysis		88.2%88.2%	71.0%
`z-ai/glm-5v-turbo`			86.6%86.6%	66.4%

All tasks and grading criteria are open source. Hover column headers for details.