Claw-some AI Agent Testing
Best For
Rank models by useful benchmark quality at the lowest observed run cost, highlighting inexpensive and high-value OpenClaw agent options.
Quick Picks
Lowest observed non-zero benchmark run cost.
stepfun/step-3.5-flashvia stepfun
google/gemma-4-26b-a4b-itvia google
openai/gpt-oss-120bvia openai
Budget recommendations use Value Score, defined as success percentage divided by best observed cost per run. Models without usable cost data are excluded.
Side-by-side metrics for the strongest recommendations on this page.
| Rank | Model | Overall | Use-Case Score | Cost | Avg Time |
|---|---|---|---|---|---|
| #1 | stepfun/step-3.5-flashstepfun | 84.7% | 170.8 value | $0.496 | 236.4m |
| #2 | google/gemma-4-26b-a4b-itgoogle | 74.6% | 167.6 value | $0.445 | 310.2m |
| #3 | openai/gpt-oss-120bopenai | 47.4% | 158.8 value | $0.299 | 194.8m |
| #4 | mistralai/mistral-large-2512mistralai | 72.9% | 152.4 value | $0.479 | 281.7m |
| #5 | openai/gpt-oss-20bopenai | 41.8% | 132.8 value | $0.315 | 122.1m |