Index

AI Observability Is Not AI Governance: The Agent Audit Gap 78% of Organizations Can't Close

Your P99 latency is green. Your SIEM is ingesting. Your LangGraph checkpoints are saving state. And yet 78% of teams in your position can't pass an AI governance audit. Observability tells you what happened. Auditing proves it was authorized. The agent fleet your team shipped last quarter is already running faster than your governance model was designed for. Here's where the gap is, and what closing it actually requires.

Stefaan Vervaet

May 14, 2026

Your general counsel calls on a Thursday afternoon.

An AI agent in your loan underwriting pipeline approved 340 applications over the prior weekend using parameters nobody signed off on. The model didn't crash. No alerts fired. Your observability dashboard showed healthy throughput and P99 latency under 200ms.

What it didn't show: the agent delegated authority it was never granted, called an external pricing API not in the approved toolset, and left no record of why it made that call.

You have metrics. You don't have proof.

This is a composite scenario based on failure patterns emerging across financial services AI deployments, not a specific institution. The dynamics it describes are real; the specifics are illustrative.

Here's what makes this scenario uncomfortable: most organizations in that room already have compliance. SOC 2 passed last quarter. GDPR signed off. Incident response playbook updated. None of it was built for an entity that reasons, delegates, and acts faster than any human reviewer can follow. SOC 2 verifies that your controls are in place. It doesn't verify that those controls cover autonomous systems making decisions under delegated authority. Those are different questions, and most governance frameworks only answer the first one.

Your governance model was designed for humans. Agents break that model.

That's the gap. According to Grant Thornton's 2026 AI Impact Survey of 950 senior business leaders across 10 industries, 78% of organizations can't close it. When Grant Thornton asked whether leaders could confidently confirm their organization would pass an independent AI governance audit within 90 days, more than three quarters said they lacked full confidence.

Three independent signals converge on this problem right now. Each one, taken alone, is a product limitation. Together, they describe a governance crisis building quietly beneath the AI agent investment boom.

‍

Why Observability Became the Default Standard?

Observability wasn't wrong. It was built for a different era.

When applications were deterministic, input A always produced output B, logging gave you meaningful forensic data. An HTTP 200 meant the call succeeded. A stack trace told you what broke. That was enough.

The tooling got sophisticated: OpenTelemetry, distributed tracing, SIEM pipelines. A generation of infrastructure engineers learned to treat "observable" as synonymous with "governed."

Then agents arrived.

An agent doesn't follow a predictable path. It reasons. It delegates. It calls tools that weren't anticipated at design time. The same prompt, run twice, may produce two different action sequences. Standard logs capture neither the reasoning nor the delegation context, just the side effects.

As the LoginRadius engineering team put it precisely: "While observability tells you that an API call was made, auditing provides the 'Chain of Thought' and the delegation metadata required to prove that the call was authorized, safe, and aligned with human intent."

That's not a subtle distinction. It's the difference between a log and evidence.

Signal One: The Governance Numbers Are Worse Than the Headlines

The 78% headline from Grant Thornton is striking enough. The deployment gap beneath it is worse.

Nearly three in four organizations are already piloting, scaling, or running autonomous AI, but only one in five has tested a response plan for AI failures. That asymmetry is not a planning oversight. It's a structural problem: the teams shipping agents and the teams designing governance are operating on different timelines, and the agents are winning.

Think about what that means during an incident. An agent made a bad call. You have logs scattered across three cloud providers, a local checkpoint file, and a SIEM that ingested some fraction of events. None of these sources has the delegation context, which human authorized which agent, with what scope, at what point in time.

Your auditor is asking for a chain of proof. You have a chain of circumstantial evidence.

Grant Thornton's survey also found that 46% of leaders say AI underperforms specifically because controls and compliance aren't working, not because the models are bad. The governance gap isn't downstream of the technology gap. For most organizations, it is the technology gap. And 43% list regulatory and compliance uncertainty as a top concern for agentic AI. That number is going up, not down.

Signal Two: What AWS AgentCore Shows, and What It Doesn't

AWS AgentCore deserves credit for moving the conversation forward.

Its policy engine evaluates tool calls against defined policies and logs those decisions, what was permitted, what was blocked, and why. CloudTrail captures registry access and administrative actions, providing an audit trail of management-plane operations. The Agent Registry (in preview as of April 2026) offers centralized discovery and governance across deployed agents.

This is materially better than nothing.

What AgentCore doesn't provide is independent cryptographic attestation. AWS does offer options to strengthen log integrity, CloudTrail with S3 Object Lock creates tamper-evident records, and external SIEM integrations can add another verification layer. For many internal audit purposes, that configuration is sufficient.

The limitation shows up in externally-scrutinized audits. Proving that a log is unmodified, without AWS serving as both the log source and the verifying party, is structurally difficult in a single-tenant cloud architecture. For a regulatory audit in FinTech, healthcare, or any EU-regulated context where the auditor has no trust relationship with your cloud provider, "trust our logs" remains a dependency that a prepared auditor will probe.

The architecture that produces the log and the architecture that verifies it are still the same architecture. For internal governance, that's fine. For external accountability, where the whole point is to prove something independently, it creates the same conflict of interest that makes financial auditors structurally independent from the firms they audit.

Signal Three: Your LangGraph Checkpoints Are Not Your Audit Trail

LangGraph is one of the most widely used agent orchestration frameworks in production. Its checkpointing system is genuinely useful: it snapshots graph state at every execution step, enables time-travel debugging, and makes fault-tolerant execution practical.

The default checkpointer is SQLite.

LangGraph's own documentation describes the SQLite implementation as "ideal for experimentation and local workflows." That's accurate. It's also a description of why you can't use it as a governance artifact.

SQLite checkpoints are local. They're not tamper-evident. They don't capture delegation metadata, who authorized the session, what scopes were granted, whether a human override occurred. They're not designed for multi-year retention. And the format changes between LangGraph versions, which means a checkpoint from three months ago may not be queryable today without migration work.

None of this is a criticism of LangGraph. Checkpointing for state persistence and auditing for governance accountability are different problems that need different architectures. The mistake, one most teams building on LangGraph are currently making, is treating one as a substitute for the other.

‍

What Auditors Actually Ask For?

Pull up the NIST AI Risk Management Framework. In the "Govern" and "Map" functions, the requirement isn't a log of what happened, it's traceability: the ability to reconstruct, in sequence, the decisions that led to an AI-driven outcome, with enough metadata to determine whether those decisions were authorized and aligned with policy.

ISO/IEC 42001, the first international management system standard for AI, requires "appropriate records" of AI system performance and decision-making, with non-repudiation as a specific property. Not just retention. Cryptographic proof that the record hasn't been tampered with.

OWASP's Top 10 for LLM Applications identifies "Excessive Agency" as a critical vulnerability: agents granted too much power, or operating without sufficient oversight. The primary control against Excessive Agency is a logging architecture that captures policy decisions and tool-call context in real time, not reconstructed after the fact.

What ties these together: a verifiable reasoning trace, some documented justification for why the agent took each action. Delegation metadata. Proof of authorization.

None of these appear natively in a standard observability stack.

‍

The Math Nobody Is Running

Take a hypothetical example to make the scale concrete. A financial services team running 50 AI agents across loan processing, fraud detection, and customer communications, where each agent handles roughly 2,000 decisions per day, produces 100,000 autonomous decisions daily. These assumptions are illustrative, scale them to your actual deployment, but the audit math holds at any comparable volume.

A governance audit with a 90-day scope across that hypothetical team covers 9,000,000 decisions. Each decision needs to produce:

The agent identity and version at decision time
The parent human identity or upstream system that authorized the session
The delegation scope granted (what the agent was permitted to do)
The tool call log with hashed parameters
The policy decision (permit or deny, and the basis for it)
The reasoning trace that justified the action
A tamper-evident timestamp establishing order of operations

Right now, most teams can surface about half of these fields from fragmented sources not designed to interoperate. The fields most often missing, delegation scope, parent identity, policy decision, reasoning trace, are exactly the fields auditors weight most heavily.

Regulations don't prescribe exact fields or architectures. But under the EU AI Act's high-risk system traceability requirements, under GDPR's accountability principle for automated decision-making, under NIST AI RMF's "Govern" function, the missing fields are exactly what auditors and regulators increasingly scrutinize. They're not cosmetic. They're the substance of the accountability question.

‍

What a Defensible Audit Trail Actually Requires?

Three properties. All three matter.

1. Immutability.

The record can't be modified after the fact, not by the agent, not by the infrastructure operator, not by an administrator with root access. This requires writing to storage that enforces immutability at the protocol layer, not just a retention policy in a mutable system.

2. Delegation lineage.

Every agent action must trace back to an authorizing human or system, with the scope of that authorization documented. If an agent calls a sub-agent, the full delegation chain must be preserved: who authorized whom to do what, in what order.

3. Reasoning trace or decision justification.

Some form of documented justification for each tool call, not necessarily raw LLM chain-of-thought output, which carries real privacy and IP considerations, must be logged alongside the call itself. This could be a structured decision log, a summarized intent, or the output of a policy evaluation. The form depends on your risk profile and regulatory context. Without some version of this, you can prove the action occurred. You can't demonstrate whether the agent was operating within its intended boundaries when it took it.

These three properties are not native to any observability stack. They require a purpose-built audit architecture, one where the storage layer is designed to be tamper-evident, not just queryable.

‍

How Akave Approaches Agent Audit Infrastructure?

Akave Cloud is built on an immutable storage ledger, which means immutability is enforced at the protocol layer, not the policy layer. When an audit event is written to Akave, the cryptographic proof of that event is embedded in the ledger. No administrator, including Akave, can modify it after the fact.

This matters for the specific problem we're describing. An audit trail stored in a mutable cloud object store relies on the trustworthiness of the operator. An audit trail stored with Akave produces a verifiable proof that an external auditor with no relationship to Akave or your organization can independently confirm hasn't been altered.

For teams building with LangGraph, AutoGPT, or custom orchestration frameworks, Akave provides S3-compatible endpoints. The integration path is straightforward: route audit events to Akave-backed storage instead of local SQLite or a standard S3 bucket. The application layer doesn't change. The evidentiary value of the output does.

For teams on AWS AgentCore, Akave serves as the external, cryptographically verifiable store for the delegation metadata and reasoning traces that AgentCore's CloudTrail logs don't capture. The two systems complement each other: AgentCore handles policy enforcement and operational observability; Akave handles the long-term, auditor-facing proof layer.

‍

When Akave Fits, and When It Doesn't

Full, externally-verifiable auditability has real costs. Storing reasoning traces for millions of agent decisions daily generates meaningful data volume. Writing to an immutable storage ledger adds an infrastructure layer. The right question isn't "should we audit everything?" It's "which agent decisions carry enough regulatory or operational risk to require independent proof?"

Where external auditability matters most:

Financial services agents making credit, pricing, or fraud decisions
Healthcare AI with patient-facing outputs or clinical decision support
Any cross-organizational agent workflow where you don't control all parties in the audit chain
Regulated deployments under the EU AI Act's high-risk system classification (effective August 2, 2026)
Multi-agent systems where sub-agent authorization chains need to be verified independently

Where your existing stack is probably sufficient:

Internal copilots and productivity agents with no regulated outputs
Low-stakes automation where the primary audit requirement is internal operational review
Experimental or development deployments where iteration speed matters more than forensic retention
Single-tenant environments where your auditor trusts your control plane and isn't requiring external verification

Akave's verifiable storage architecture is the right fit when auditability needs to survive outside your own infrastructure's trust boundary: regulatory audits, cross-organizational compliance reviews, long-term forensic retention. If you need an external party to confirm the integrity of a log without trusting your control plane, that's the problem Akave is designed to solve. If your requirements are entirely internal, a well-configured observability stack with CloudTrail and Object Lock may be all you need, and that's a legitimate answer for many workloads.

‍

Looking Ahead

Grant Thornton's finding that 43% of organizations list regulatory and compliance uncertainty as a top concern for agentic AI is going to look like an underestimate in six months.

The EU AI Act's high-risk system provisions take full effect on August 2, 2026. SEC AI disclosure is an area of increasing regulatory attention as the Commission's posture toward autonomous systems continues to develop. Enterprise procurement teams are beginning to require AI governance attestation as a vendor contract condition, the same trajectory SOC 2 followed over the prior decade.

The organizations ahead of this curve share one characteristic: they treated audit infrastructure as a design requirement, not a retrofit. They didn't wait for an incident to discover their observability stack couldn't answer the questions an auditor was asking.

The difference between those organizations and the 78% isn't technical sophistication. It's the decision, made early, that "we can explain what our agents did" is not the same as "we can prove it."

That gap closes from the architecture up. Not from the governance deck down.

‍

FAQ

What is the difference between AI observability and AI auditing?

Observability captures system state, latency, error rates, API call logs, to support debugging and performance monitoring. AI auditing captures decision accountability: which agent took which action, under whose authorization, with what justification, preserved in a tamper-evident record. Observability answers "what happened." Auditing answers "can we prove it was authorized and safe." For regulated deployments, only the second question matters to an external auditor or regulator.

Why can't I use LangGraph checkpoints as my audit trail?

LangGraph checkpoints are designed for state persistence and fault tolerance, not governance accountability. The default SQLite implementation is optimized for local development and experimentation. Checkpoints don't capture delegation metadata, aren't tamper-evident, and aren't designed for multi-year forensic retrieval. They're a debugging tool, not a compliance artifact, and LangGraph's own documentation describes them that way.

Does AWS AgentCore provide a complete audit trail for agentic AI?

AgentCore's policy engine and CloudTrail integration provide a meaningful audit foundation, and the April 2026 Agent Registry preview adds centralized governance. But the audit trail lives within AWS's control plane, there's no cryptographic attestation that the log hasn't been modified, which creates a "trust the operator" dependency that external auditors in regulated industries are trained to challenge. Akave can serve as the external, verifiable layer that AgentCore's architecture doesn't provide on its own.

How does Akave's immutability work for AI agent audit logs?

Akave stores data on it’s own immutable storage ledger, where the cryptographic proof of each stored object is embedded in the chain state at write time. Once written, that proof can be independently verified by any party without trusting Akave or your organization. For audits, this means an auditor can confirm the log you're presenting hasn't been modified since it was written, without needing a trust relationship with your infrastructure provider.

What does a defensible AI agent audit trail need to include?

Three things: immutability (the record can't be changed after the fact), delegation lineage (which human or system authorized each agent action, with what scope, in what sequence), and a reasoning trace or decision justification (some documented basis for each tool call, not just the call itself). The exact form of that third element, raw reasoning output, structured decision log, policy evaluation result, depends on your risk profile and privacy requirements. Missing any of the three creates a gap a prepared auditor will find.

How long should AI agent audit logs be retained?

Requirements vary. SOC 2 generally requires one year. HIPAA requires six. GDPR doesn't set a fixed period but requires records available on request during investigations. The EU AI Act requires providers of high-risk AI systems to retain technical documentation for 10 years after placing a system on the market (Article 18), though log retention periods for specific system types vary, check your system's classification against the Act's Annex III. Design your retention architecture around the most demanding regulatory context you operate in, not the most convenient one.

If we're already passing SOC 2 audits, why do we need a separate AI audit trail?

SOC 2 verifies that your controls are in place. AI governance audits verify that your agents operated within their authorized boundaries. These are different questions. Your existing security controls weren't designed to capture delegation metadata for non-human identities, or to preserve the reasoning traces that justify autonomous decisions. A clean SOC 2 doesn't address the evidentiary requirements for agentic AI under NIST AI RMF, ISO/IEC 42001, or the EU AI Act.

What's the practical first step for a team with no audit trail today?

Start with identity. Every agent should have a distinct, versioned identity with lifecycle governance, not a shared service account. Log every session start with the parent authorizing identity, the delegation scope granted, and the timestamp. That alone closes the most common audit gap: the inability to attribute autonomous actions to a specific authorizing human. From there, layer in reasoning traces and tool-call logs. Akave's S3-compatible endpoints mean you can route these events there from day one without changing your application layer.

‍

Start Building a Defensible Audit Trail

Run your architecture against the three-property checklist above, immutability, delegation lineage, reasoning trace, and identify the first gap.

Review Akave's storage architecture at akave.com/product.

Start a free trial at akave.com/free-trial or explore the S3 integration docs at docs.akave.xyz.

Get Started

Modern Infra. Verifiable By Design

Whether you're scaling your AI infrastructure, handling sensitive records, or modernizing your cloud stack, Akave Cloud is ready to plug in. It feels familiar, but works fundamentally better.

Try Risk-free

Meet With Us

Check Out Our Docs ›