Claw-some AI Agent Testing
Quick Picks
anthropic/claude-opus-4.7Highest verified success rate across the benchmark.
inception/mercury-2Lowest observed complete benchmark runtime.
meta-llama/llama-4-scoutLowest observed non-zero benchmark run cost.
google/gemma-4-26b-a4b-itBest success percentage per dollar.
Percentage of tasks completed successfully across standardized OpenClaw agent tests
Scores are graded via automated checks and LLM judge. How we benchmarkยทView all tasks
Hosted OpenClaw โ your personal AI agent, managed by Kilo.
Hosting and inference cost for PinchBench sponsored by Kilo, so we totally hope you try KiloClaw so we can keep the lights on around here.
From $8/month + AI inference at cost