Search intent answer
Agent task success rate is the percentage of benchmark tasks an agent completes within the accepted criteria. It should include more than test pass status: relevant diff scope, review quality, cost budget, and absence of prohibited changes all matter.
When it matters
- A team wants to know whether coding agents improve over time or degrade after updates.
- Leaders need a fair metric for comparing multiple tools across teams.
- A security or compliance reviewer asks how agent-assisted coding is measured and controlled.
How to operationalize it
- Define task-level acceptance criteria before the agent starts.
- Record pass, partial, fail, timeout, cost overrun, and unsafe-edit statuses separately.
- Track success by agent, task family, repository, model version, CLI version, and prompt pack.
- Review confidence intervals or minimum sample sizes before making procurement decisions.
- Use failure replay to explain major drops rather than relying only on the percentage.
Common risks
- A high success rate can hide dangerous outliers if sensitive file changes are not separated.
- Tasks that are too easy inflate success rate and fail to predict production usefulness.
- Changing the benchmark set without versioning breaks trend comparisons.
How ClaudeBench Drift connects
ClaudeBench Drift calculates task success rate with task taxonomy, versioned benchmarks, and failure categories that make the metric actionable.