Observability and verification: how to know agents are doing the work

Your agent shipped 140 outputs last week. Your Datadog dashboard says everything is fine. Every API call returned a 200. The agent ran inside its time budget. Your CEO asked, quietly, whether you actually know if any of those 140 outputs were correct.

You don't. And that's not a Datadog problem.

Monitoring tells you the agent didn't crash. Verification tells you the agent did the right thing. Most teams have monitoring. Almost nobody has verification. The gap between the two is where every agentic-AI horror story starts.

Here's the verification queue pattern we run on top of our agent team at VentureIO. It is a week of work to build, it does not require a new platform, and it caught seven decisions in May that would have embarrassed the business if they'd shipped.

TL;DR

What monitoring tells you (and doesn't)

Your monitoring stack, Datadog, Grafana, CloudWatch, whatever, measures:

All useful. None of these tell you whether the agent's output was correct.

Here is a real example from us. An agent we run sends scheduled drip emails to leads who haven't replied in 14 days. Last March, the agent had a 99.6% monitoring health score for the whole month. Every email sent. Every API call returned 200. Time per send was inside budget.

It was also writing emails that referenced the wrong product to about 8% of the recipients. The agent had picked up a stale field from the lead record and used the previous product name. The CEO found out from a customer reply that said "we don't even use that product." That customer was a buyer in a parallel deal. It cost us the deal.

The monitoring stack was happy. The verification layer would have caught it on the third email.

What verification actually is

Verification is a second agent reading the first agent's output and answering one question: does this output meet the criteria we set for this type of work?

That sentence contains the whole pattern. Three pieces.

One: there is a separate agent. Not the same agent grading its own output. Not a built-in "confidence score" from the model. A separate Claude or GPT call with a different system prompt, different role, and different incentives.

Two: there is a written criteria document for each type of output. The criteria are pre-defined. The verifier reads the output and the criteria, and produces a pass / fail / unsure verdict with a one-sentence reason.

Three: there's a queue. Pass goes through. Fail gets blocked. Unsure goes to a human review queue. The human queue is the founder's morning briefing, and it stays under 10 items per day if the criteria are written well.

That's it. That's the system.

The sampling rules

You do not verify every output. The cost is too high and the latency hits product UX. You sample. Three rules in our system:

Rule 1: random sample
      4 outputs per agent per day, uniformly random
      catches drift the other rules miss

    Rule 2: confidence-based
      any output where the agent's self-rating is below 0.7
      catches the cases the doing-agent already suspected

    Rule 3: high-impact
      100% sample for: outbound emails to non-team addresses,
      payments, contract changes, public posts, customer-facing decisions
      catches the irreversible ones
    

The numbers are not religious. Tune them. We started with 2 random samples and moved to 4 after we missed something in April. Confidence threshold started at 0.6 and moved to 0.7 because the verifier was catching too many borderline-fine outputs.

The high-impact list is the most important one. Any action that you cannot undo with one click gets sampled at 100%. Outbound email to a customer is sampled. Slack post inside the team is not. Payment processing is sampled. Internal database read is not.

If you don't sample irreversible actions at 100% you will eventually ship one you regret.

The verification agent's prompt

The shape of the verifier's system prompt is short. Here's a redacted version of ours for outbound-email verification:

You are the verification agent for the Outreach Closer.

    You will receive: (1) the email draft the Outreach Closer wrote,
    (2) the lead record it was based on, (3) the criteria document.

    Your job: read the draft against the criteria, then output JSON:
    {
      "verdict": "pass" | "fail" | "unsure",
      "reason": "one sentence, concrete",
      "flags": [array of specific criteria violated]
    }

    Criteria for outbound email:
    1. Names the recipient's actual company name (not a generic placeholder)
    2. References a real public signal about that company in the last 90 days
    3. Does not promise outcomes ("we will get you 10x leads")
    4. Does not contain em-dashes, "leverage", "synergy", "delve",
       "robust", "seamless"
    5. Closes with a specific ask under 12 words
    6. Does not exceed 130 words total

    If any criterion fails, verdict is "fail". If you are not sure
    whether a criterion is met, verdict is "unsure".

    Output JSON only. No prose.
    

Six criteria. Each criterion is observable from the draft alone. No vague tests like "is this email good." The verifier can apply each criterion independently and the verdict is reproducible.

Writing this prompt for each agent is the actual work in the project. Plan a half-day per agent type. The verifier's prompt is a living artifact that you tune for two weeks after going live.

The queue table

One table:

CREATE TABLE verification_queue (
        id BIGSERIAL PRIMARY KEY,
        doing_agent TEXT NOT NULL,
        output_id TEXT NOT NULL,
        output_payload JSONB NOT NULL,
        verdict TEXT NOT NULL,            -- pass / fail / unsure
        reason TEXT,
        flags JSONB,
        sampled_by TEXT NOT NULL,         -- random / confidence / high_impact
        ts TIMESTAMP NOT NULL DEFAULT now(),
        human_reviewed BOOLEAN DEFAULT false,
        human_verdict TEXT,
        human_notes TEXT
    );
    

Every verification result lands here. Pass rows can be deleted on a 30-day rolling window. Fail and unsure rows get human review and are kept indefinitely, they're your training corpus for tuning the criteria.

The morning briefing the founder gets is a query against this table. "Show me everything from yesterday with verdict fail or unsure, ordered by impact." If that list is under 10 items, the founder spends 12 minutes on it. If it's over 25 items, the criteria need tuning because the verifier is too tight.

Why the verifier must be a separate agent

People shortcut this. They put the verification logic inside the doing agent, with a "now check your own work" step at the end.

It does not work. The doing agent shares its own blind spots. If the doing agent doesn't know the product name is stale, it will not know its own draft is wrong. Self-grading is bias-laundering.

The verifier is a separate agent with a different system prompt, different access scope, and ideally a different model family. We run our verifier on a smaller, cheaper model than the doing agent, verification is a focused, criteria-bound task, and the smaller model does it well at a fraction of the cost.

Separate prompt. Separate role. Separate model. Different incentives. That's the only configuration that actually catches things.

What this catches that monitoring doesn't

A partial list of things our verification layer caught in May 2026 that our monitoring stack would have missed completely:

Seven catches in 31 days. Each one would have shipped without verification. Each one would have caused a customer complaint, a public correction, or a hard conversation. The verification layer paid for itself in the second week.

What it costs to run

We instrumented the whole thing in seven days of engineering time. Ongoing API cost runs about $40 a month, the verifier is small-model, the queue is Postgres, the notifications are Slack. The human-review time is under 15 minutes a day for the founder.

The number to beat is the cost of one shipped mistake. One email to a buyer with the wrong company name costs more than a year of verification API calls.

If you want this verification queue built into your agent stack in 14 days, look at the blueprints. See the blueprints. The spec includes the criteria templates we use for each agent type, the verifier prompts, the SQL schema, and the morning-briefing query.

The order of operations

If you're starting from no verification today:

  1. Today. List every agent you run. Note which of their outputs are reversible and which are not.
  2. This week. For the irreversible ones, write the criteria document. Six to ten criteria each. Pre-write them, don't generate them.
  3. Next week. Build the verifier agent. One Python service, three sampling rules, one queue table, one Slack channel.
  4. Two weeks in. Tune. Watch what gets flagged. Watch what slips through. Tighten or loosen the criteria. Re-tune the random sample rate.
  5. Steady state. The founder reads the morning queue in 12 minutes. Mistakes get caught before they leave the building.

Verification is what turns an agent team from "the thing you worry about overnight" into "the thing that runs while you sleep." Build it before you scale. Build it before you ship more agents. Build it this week.


If you want the verifier prompt templates and the criteria docs we use across our 17 agents, email me. christine@operatoriq.io. Subject line: "verifier spec."

Next: how to put the right security and compliance controls around all of this so your CISO greenlights the rollout.