The new metrics: agent throughput, verification rate, recovery rate
How do you know the agents are actually working?
This is the question every operator asks the second week of running an agentic system in production. The old metrics don't fit. Tickets resolved assumes a human is closing them. Revenue per FTE doesn't know how to count an agent. NPS surveys aren't going to tell you that your support agent silently dropped half its tickets last Thursday.
You need a new vocabulary. Here it is, with formulas, with real numbers, and with the 20 lines of code you need to start tracking them today.
TL;DR
- The three metrics that matter most for an agentic-AI-first business are throughput, verification rate, and recovery rate. Track these three before anything else.
- Throughput = units of work produced per agent per day. The "units" are whatever the agent's job description says it ships (emails sent, posts published, tickets closed, leads qualified).
- Verification rate = the percentage of agent outputs that pass an independent check. If you don't have an independent check, the agent isn't running in production, it's running in trust.
- Recovery rate = the percentage of failures the system caught and fixed without a human in the loop. High recovery rate is the difference between a working system and a system that needs a babysitter.
- Two secondary metrics also matter: drift rate (how often the agent did something outside its authority envelope) and silent-green rate (how often the agent logged success with zero actual output).
Why the old metrics don't work
The metrics every SaaS dashboard ships with assume a human worker. Revenue per FTE was invented to measure how much output you get from a person on payroll. Tickets resolved counts the closes on a human's queue. NPS surveys ask one human about another human's behavior.
None of these work for an agent. An agent doesn't have an FTE cost the same way a person does. An agent can close a ticket without doing the work. An agent can claim a job complete and have produced nothing. The metric you need isn't a re-skinned human metric, it's a new metric that captures what an agent does and doesn't do.
We learned this the hard way. We ran our first agentic outbound stack for two weeks reporting "100% job success rate" because every scheduled task logged exit code zero. We then checked the actual outbox and found that one of the three send branches had been silently failing for nine days. The exit code was lying. The right metric would have caught it on day one.
That experience produced the three primary metrics below.
Metric 1: Throughput
Definition. Units of work produced per agent per day, where "units" are whatever the agent's job description says it ships.
Formula. throughput = count(outputs in period) / count(agents) / count(days in period)
Real numbers from our running system. Blog Writer ships 1.0 substantive posts per agent per day, averaged over 30 days. Outreach Closer sends 14-32 cold emails per agent per day, with the variance driven by inbox health and ICP queue depth. Support Agent handles 3-9 inbound replies per day. Distributor pushes 4-6 syndication events per published asset. Operator runs one full ops loop per day with about 40 sub-checks in each loop.
Why it matters. Throughput is the first thing to fall when something is wrong. If throughput drops from 32 emails per day to 8 with no obvious schedule change, something broke. You can debug from there. If you don't track throughput, the breakage hides behind "green" status checks for days.
How to instrument it. In every agent's main loop, write a single event to a shared log with the agent name, the timestamp, and the unit it just produced. In our stack, that's a one-line append to a JSON Lines file: runs.jsonl. Twenty lines of Python to set up. Read the file once a day, group by agent and date, count, divide.
Metric 2: Verification rate
Definition. The percentage of agent outputs that pass an independent check.
Formula. verification_rate = count(outputs verified pass) / count(outputs produced)
Real numbers. Our current verification rates: Blog Writer 0.94 (a separate QA agent reads each post and checks for the lint rules and the citation density before publish). Outreach Closer 0.91 (the linter checks every email against the format rules, and a second pass checks the actual SMTP send acknowledgement, not just the schedule's exit code). Support Agent 0.88 (a second agent reads every reply before send and checks for envelope compliance). Distributor 0.97 (each channel post is fetched back after publish to confirm it actually appeared).
Why it matters. This is the metric that catches the lying logs. If your agent says "I sent 40 emails" and the SMTP confirmation count is 12, your verification rate is 0.30, not 1.0. The exit code is not a verification. A verification is a separate check, by a separate process, against the actual artifact in the actual destination.
How to instrument it. Every agent has to write its claimed output to a queue. A separate verification process (another agent, a cron job, a Lambda) reads the claim, goes to the actual destination, and checks whether the artifact is there. If it is, mark verified. If it isn't, mark failed. Verification rate is verified divided by claimed.
The most important rule: the verification process must not be the same code that produced the output. If it's the same code, you're checking the agent's homework with the agent's homework.
Metric 3: Recovery rate
Definition. The percentage of failures the system caught and fixed without a human in the loop.
Formula. recovery_rate = count(failures auto-recovered) / count(failures)
Real numbers. Recovery rate across our running stack: 0.71 over the last 30 days. That means 71% of the failures our system saw last month, we fixed without me touching anything. The other 29% escalated to me via the NEEDS_CHRISTINE queue, where I make the call.
Why it matters. Recovery rate is the difference between a system that runs the business and a system that needs you in the loop. If your recovery rate is 0.10, you're a babysitter, not an operator. If your recovery rate is 0.70+, you have a real autonomous system. If your recovery rate is 0.95+, you've either built something extraordinary or you've stopped escalating things that should be escalated. The second case is more common; check your escalation criteria before celebrating.
How to instrument it. Every agent failure has to write a row to a shared incidents log. Every recovery action (retry, fallback, alternative path, graceful degradation) writes another row referencing the original incident. Every human-handled escalation writes a third row. Recovery rate is (incidents resolved by an agent) / (total incidents).
Want this dashboard built for your business? Our Concierge blueprint ships a verification queue, an incident log, and a recovery harness wired against your agents in days. Single email, single payment. Or email christine@operatoriq.io with the agents you're already running. Email only, no calls.
Secondary metric 4: Drift rate
Definition. How often the agent did something outside its authority envelope.
Formula. drift_rate = count(actions outside envelope) / count(total actions)
Real numbers. Across our running stack, drift rate sits around 0.012, meaning roughly 1 in 80 actions is something we have to flag as outside the agent's defined scope. Most drift is harmless (the agent tried to send to a channel it's not allowed to write to, the action got blocked at the permission layer). Some drift is loud and would have been damaging if the envelope hadn't been enforced (the Outreach Closer once tried to email a contact that the Lead Sourcer had marked do-not-contact; the envelope check rejected the send).
Why it matters. Drift rate is your safety telemetry. A drift rate above 0.05 means the agent doesn't understand its own job description. A drift rate of zero means either you've engineered a beautifully tight envelope or your envelope check isn't running. The latter is the more common case.
How to instrument it. Every action the agent proposes goes through a permission check before execution. If the check rejects the action, log it as a drift event. The shape of the event needs to include what the agent tried to do, why it was rejected, and what the agent did instead.
Secondary metric 5: Silent-green rate
Definition. How often the agent logged success with zero actual output.
Formula. silent_green_rate = count(success-logged with empty output) / count(success-logged total)
Real numbers. Our target is 0.00. Anything above 0.00 is a system bug. Last month we hit it twice (both times caused by a model API rate-limit response that returned 200 with an empty body, which we then logged as a success). Both were caught by the verification rate check, which is exactly the redundancy we wanted.
Why it matters. This is the failure mode that destroys your trust in the system. You think the agent worked, the log says it worked, but nothing happened. If you don't measure silent-green explicitly, you'll discover it by reading a complaint email from a customer asking "where's the thing you said you sent me."
How to instrument it. When an agent logs success, also log the byte count, the row count, or the artifact pointer of what it produced. A separate process scans success logs for null, zero, or empty artifact pointers and writes them to the silent-green incident table.
Putting it on one dashboard
Here's the table we read every morning. Two minutes per agent.
| Agent | Throughput (30d avg) | Verification rate | Recovery rate | Drift rate | Silent-green |
|---|---|---|---|---|---|
| Blog Writer | 1.0 posts/day | 0.94 | 0.83 | 0.005 | 0.00 |
| Outreach Closer | 22 emails/day | 0.91 | 0.78 | 0.018 | 0.00 |
| Support Agent | 6 replies/day | 0.88 | 0.70 | 0.011 | 0.00 |
| Distributor | 5 events/asset | 0.97 | 0.81 | 0.004 | 0.00 |
| Lead Sourcer | 18 leads/day | 0.93 | 0.74 | 0.009 | 0.00 |
| Operator | 1 loop/day | 0.96 | 0.65 | 0.001 | 0.00 |
That's the whole dashboard. Six columns. You can read it on your phone. Anything trending down two days in a row is a flag. Anything that drops more than 20% day-over-day is an incident.
The 20-line implementation
Here's the minimum-viable instrumentation. Every agent appends to a shared JSON Lines file. A second agent reads it once a day, computes the metrics, writes the table.
import json
import time
from pathlib import Path
LOG = Path("runs.jsonl")
def log_event(agent: str, event_type: str, **kwargs):
"""Every agent calls this. event_type in {output, success, fail, drift, recover}."""
row = {
"ts": time.time(),
"agent": agent,
"event_type": event_type,
**kwargs,
}
with LOG.open("a") as f:
f.write(json.dumps(row) + "\n")
def compute_throughput(agent: str, days: int = 30) -> float:
cutoff = time.time() - days * 86400
with LOG.open() as f:
rows = [json.loads(line) for line in f]
outputs = [r for r in rows
if r["agent"] == agent
and r["event_type"] == "output"
and r["ts"] >= cutoff]
return len(outputs) / days
That's enough to start. Verification rate, recovery rate, drift rate, and silent-green rate are the same pattern with different filters. Wire them up this afternoon. You'll know within a week whether your agents are actually working.
What this dashboard tells you that the old one doesn't
It tells you the system is honest. It tells you which agents are pulling weight. It tells you when something silently broke. It tells you whether you've built a working autonomous system or whether you've built a slow-motion incident waiting for a customer to find it for you.
The old dashboard (uptime, revenue, NPS) tells you whether the business is alive. The new dashboard tells you whether the agents that run the business are alive. Those are different questions. You need both.
What's coming next
Tomorrow's post catalogs the specific failure modes we've caught with this dashboard: silent green exits, mocked work, fabricated outputs, schedule drift, authority creep. If this post is the dashboard, the next one is the field manual for what to do when the dashboard turns red.
Read it alongside the cornerstone definition of an agentic-AI-first business and the post on hiring and you'll have the org chart, the dashboard, and the incident book that every operator of one of these systems needs.
Want this instrumented for your agents in a week? The blueprint catalog includes a Verification Queue and an Incident Harness blueprint. Single email, single payment. Or email christine@operatoriq.io with what you're running today. Email only, no calls.