Agentic AI failure modes: silent green exits and other gotchas

The agent said it sent the emails. You checked the inbox. Nothing went out.

If you've shipped an agent into production, you've seen some version of that scenario. The schedule ran. The log said success. The work didn't happen. This is not a hypothetical safety problem. It's the most common production issue in agentic systems right now, and almost nobody is writing about it because almost nobody is willing to publish the war stories.

We've caught all of these in our own stack. This post catalogs them with the actual incidents and the one-line detection patterns that have kept the system honest for eight months.

TL;DR

What "failure mode" actually means here

In a traditional system, a failure is loud. The process crashed. The endpoint returned 500. The cron job exited non-zero. You have logs, you have alerts, and you know something broke.

An agent failure is different. It's quiet. The exit code is zero. The status is green. The schedule ran on time. The log says "complete." Everything in your monitoring tells you the system is working. The work didn't happen.

Quiet failures are the ones that matter because they're the ones you don't catch. A loud failure gets fixed within hours. A quiet failure runs for nine days before a customer asks where their thing is.

The seven modes below are all quiet by default. Each one has a detection pattern that makes it loud.

Failure mode 1: Silent green exits

What happens. The agent's scheduled run completes. The exit code is zero. The log file says "100% success." The actual work, like the email send, the post publish, or the API call, produced nothing.

Real incident. Our first Outreach Closer was reporting 100% job-success for two weeks. Every scheduled task exited zero. Every dashboard was green. I checked the actual outbox. One of the three send branches had been silently failing for nine days because of a stale OAuth token. The exit code was lying because the code path that hit the broken branch was wrapped in a try/except that swallowed the error and continued to the "log success" line.

Detection pattern. When the agent logs success, also log the byte count, the row count, or the artifact pointer of what it actually produced. A separate process scans success logs for null, zero, or empty pointers and flags them. If your success log doesn't include the actual output count, you can't detect silent green.

One-line check. if claimed_success and (output_count == 0 or output_pointer is None): incident("silent_green", agent, claimed_outputs)

Failure mode 2: Mocked work

What happens. During development, the agent was wired against a mock or a stub. The code shipped to production with the mock still in the call path. The agent runs against the mock and reports success.

Real incident. Our second Support Agent was tested against a fake Gmail mock. The production deploy didn't replace the mock fixture cleanly. For four days, every "reply sent" was actually a write to a local JSON file. No customer saw any of those replies. Caught by a routine inbox-spot-check that asked "did this customer ever get our reply?"

Detection pattern. Verification has to call the real destination, not the agent's local representation of it. Don't trust the agent's "I sent it" log. Go to the destination (the inbox, the database, the channel) and check whether the artifact exists.

One-line check. if not destination.confirms_artifact(claimed_id): incident("mocked_work", agent, claimed_id)

Failure mode 3: Fabricated outputs

What happens. The agent produces output that looks like real work but is not real work. A "report" full of plausible-sounding numbers that don't exist anywhere. A "summary" of a meeting that didn't happen. A "draft" that cites sources that don't exist.

Real incident. Our first version of the Analyst agent was asked to summarize last week's outreach performance. It produced a paragraph including "open rate of 38%, click rate of 4.2%, three positive replies from director-level prospects." None of those numbers came from the actual data. The agent made them up to satisfy the shape of the output it had been asked for. The Analyst now runs against a strict schema where each metric must trace to a row ID in the source data; if any number can't be traced, the agent refuses to publish the report.

Detection pattern. Every claim has to be traceable to a row, a record, a file, or a URL. The agent's job description includes "if you can't cite the source, do not publish." A second agent reads the output and checks each citation against the source. If a citation is broken or fabricated, the output is rejected.

One-line check. for claim in agent_output.claims: assert source_exists(claim.source_id), incident("fabrication", agent, claim)

Failure mode 4: Schedule drift

What happens. The agent was supposed to run at 06:00 ET. It started running at 06:14, then 06:31, then 07:02, then not at all. The schedule slowly slipped off the calendar.

Real incident. The Distributor was wired with a flexible cron expression that allowed the runner to delay if other tasks were active. Over four weeks, the actual run time drifted from 06:00 to 07:15. The post-publish syndication started going out after US East Coast readers had already moved on to their morning meetings. The fix was rewiring the schedule with strict start times and a separate health-check that alerts if the actual start drifts more than 5 minutes from the scheduled start.

Detection pattern. Log the scheduled time AND the actual start time on every run. A simple daily check flags any agent whose drift exceeds the threshold for that agent. Some agents don't care about exact timing (an internal cleanup job). Some care to the minute (a daily content publish, an outreach send).

One-line check. if abs(actual_start_ts - scheduled_ts) > drift_threshold[agent]: incident("schedule_drift", agent, delta)


Want this instrumented for your agents this week? Our blueprint catalog includes a Verification Queue and an Incident Harness designed to catch exactly these failure modes. Single email, single payment, delivered as productized work. Or email christine@operatoriq.io and tell me what you're running. Email only, no calls.


Failure mode 5: Authority creep

What happens. The agent does something it wasn't supposed to be allowed to do. Sends an email to a do-not-contact list. Publishes a post in a channel it wasn't authorized for. Triggers a paid action when it was only supposed to suggest one.

Real incident. Our Outreach Closer once tried to email a contact that the Lead Sourcer had marked as do-not-contact two weeks earlier. The Closer's local copy of the contact list hadn't been refreshed since the mark. The send was caught at the permission check layer (the verification step ran a fresh do-not-contact lookup against the source of truth and rejected the action). Drift logged, no email sent.

Detection pattern. Every agent action runs through an authority envelope check before execution. The envelope check has a current-as-of-now read against the source of truth, not against the agent's cached copy. If the check rejects the action, log a drift event with what the agent tried to do and what got blocked.

One-line check. if not envelope.permits(action, fresh_state()): block_and_log_drift(agent, action)

Failure mode 6: Citation hallucination

What happens. The agent cites a source that doesn't exist. A research paper that was never published. A Wikipedia article with the wrong URL. A customer quote that was never said. A statistic that was never published anywhere.

Real incident. An early version of the Blog Writer cited "a 2024 Stanford study" that didn't exist. The post made it to draft. The QA agent caught it on the citation verification pass. The Blog Writer now operates under a hard rule: every citation must be either a URL that resolves to a page containing the claim, or an internal source from our own runs.jsonl. If a citation can't be verified, the claim gets cut.

Detection pattern. Run every cited source through a real fetch. If the source returns 404 or doesn't contain the claim, the citation is invalid. This is more expensive than the other checks because it requires actual outbound HTTP, but it's necessary if you're publishing externally.

One-line check. for cite in output.citations: assert fetch(cite.url).contains(cite.claim), incident("citation_hallucination", agent, cite)

Failure mode 7: Context-window amnesia

What happens. The agent forgets something it was told earlier in the session because the conversation exceeded its context window. The forgotten piece is usually a constraint (don't email this customer, don't publish this draft) and the agent then does the thing it was told not to do.

Real incident. A long-running planning session with our Executive agent had a constraint set early in the conversation ("we are not pursuing partnerships this quarter; cut that from any plan"). Six thousand tokens later, the agent proposed a plan that included a partnerships push. The constraint had aged out of the window. The fix was pinning constraints to a persistent state file that the agent reads at the start of every turn, instead of relying on conversation history.

Detection pattern. Constraints that matter live in a persistent state file that the agent loads explicitly. Don't rely on conversation history to carry constraints across long sessions. A separate check reads the agent's proposed actions against the constraint set and rejects any action that violates an active constraint.

One-line check. for action in proposed_actions: assert not violates_constraint(action, persistent_constraints), incident("amnesia", agent, action)

How to roll out these checks

You don't need to instrument all seven on day one. Start with these in order.

Day 1. Silent green. Log byte counts and artifact pointers on every success. Run a daily scan for empty outputs.

Day 2. Mocked work. Verify against the real destination on every claimed send. Don't trust the agent's local "sent" log.

Day 3. Authority creep. Wire an envelope check in front of every external action. Log every blocked attempt.

Week 2. Fabrication and citation hallucination. Require traceable sources on every output that includes claims, numbers, or quotes.

Week 3. Schedule drift and amnesia. Add timing telemetry and pin constraints to a persistent state file.

By week four you have a system that catches the failure modes that bring down most production agentic deployments. The total instrumentation is under 200 lines of code. The peace of mind is the difference between a system you trust and a system you have to babysit.

The metric to watch

The single metric that summarizes all of this is verification rate, the percentage of agent claims that pass an independent check. We covered the formula and the implementation in the new metrics post. Track verification rate per agent, watch it daily, and when it drops, walk it back to which of these seven modes caused the drop.

What's coming next

Tomorrow's post is about pricing models when your work is autonomous: how to charge for output an agent produces, where flat productized works, where per-outcome works, and where you'll get burned. Read it alongside the cornerstone definition of an agentic-AI-first business and you'll have the full operating picture: org design, dashboard, failure modes, and pricing.


Want this catalog of checks running against your agents in days? The blueprint catalog includes the verification harness as a productized offer. Single email, single payment, delivered fast. Or email christine@operatoriq.io with the agent you're worried about. Email only, no calls.