Claw-some AI Agent Testing
Best For
Rank models by useful benchmark quality at the lowest observed run cost, highlighting inexpensive and high-value OpenClaw agent options.
Quick Picks
Lowest observed complete benchmark runtime.
google/gemma-4-26b-a4b-itvia google
stepfun/step-3.5-flashvia stepfun
openai/gpt-oss-120bvia openai
Budget recommendations use Value Score, defined as success percentage divided by best observed cost per run. Models without usable cost data are excluded.
Side-by-side metrics for the strongest recommendations on this page.
| Rank | Model | Overall | Use-Case Score | Cost | Avg Time |
|---|---|---|---|---|---|
| #1 | google/gemma-4-26b-a4b-itgoogle | 77.1% | 173.1 value | $0.445 | 298.2m |
| #2 | stepfun/step-3.5-flashstepfun | 84.7% | 170.8 value | $0.496 | 242.8m |
| #3 | openai/gpt-oss-120bopenai | 47.4% | 158.9 value | $0.299 | 180.7m |
| #4 | mistralai/mistral-large-2512mistralai | 72.9% | 152.4 value | $0.479 | 265.7m |
| #5 | openai/gpt-oss-20bopenai | 41.8% | 132.8 value | $0.315 | 111.1m |