Index

This is some text inside of a div block.

Your AI Factory Is Confidential. Your Training Data Isn't.

Confidential computing using trusted execution environments secures AI computation, but it does not verify the integrity of training data before it enters the enclave. This post examines the gap between compute-side attestation and data-side provenance in AI factory architectures, drawing on recent research into training data poisoning and supply chain attacks. It explains how content-addressed, write-time verification closes the loop that TEEs cannot. Met

Stefaan Vervaet

April 30, 2026

At NVIDIA GTC 2026, Fortanix demonstrated exactly what a hardware-enforced AI factory looks like. Confidential pipeline. Composite attestation. HSM-gated key release. Hopper and Blackwell GPUs running inside trusted execution environments where even the infrastructure operator can't see what's being processed. TELUS and Fortanix announced they'd extended this to Canadian data sovereignty, cryptographic proof that data never leaves Canadian jurisdiction during AI training and inference. NTT DATA brought the same architecture to India for DPDP Act compliance.

It's genuinely impressive infrastructure. And it solves exactly what it says it solves: securing the compute layer.

Now ask a different question. Ask what happens to your training dataset before it enters the enclave. Ask where the proof lives that the data your model trained on is the same data your data governance team approved, not a modified version, not a swapped version, not a version someone touched between your compliance review and the moment it crossed into the TEE.

That question doesn't have an answer in the Fortanix architecture. Not because Fortanix made a mistake. Because it's a different layer of the problem, and right now, almost nobody is building it.

‍

What Confidential Computing Actually Guarantees, and Akave's Case for What It Doesn't?

Confidential computing is a hardware-level technique that keeps data encrypted and isolated while it is being processed, not just at rest or in transit. A trusted execution environment (TEE) creates a hardware-enforced boundary around a computation: the enclave runs on the CPU or GPU, the host operating system cannot read its memory, and remote attestation produces cryptographic evidence that the computation ran in verified hardware with verified software.

This is a meaningful guarantee. It means a cloud provider, a neocloud operator, even a nation-state with physical access to the server cannot extract what the model is processing during inference. For regulated industries, healthcare, financial services, defense contractors, it addresses a real and specific problem.

What confidential computing does not guarantee: that the data entering the enclave was authoritative, unmodified, and provenance-verified at the time of ingestion. The TEE attests that the computation happened correctly. It does not attest to the integrity of its inputs before they arrived.

This is not a limitation unique to Fortanix. It's inherent to the architecture. TEEs are designed to secure in-use computation, not to verify the supply chain of the data feeding that computation. Those are different problems that require different solutions.

‍

Why the Input Boundary Is Where AI Factories Are Actually Vulnerable?

The attack surface that confidential computing can't reach is the data pipeline itself: the storage buckets, preprocessing scripts, feature stores, and dataset registries that exist between data collection and model ingestion.

Research from Columbia, NYU, and Washington University found that as few as 50,000 manipulated articles added to a public training dataset were sufficient to corrupt medical LLMs, producing systematically biased outputs that persisted even after retraining on clean data. JFrog's security research team identified approximately 100 malicious models on HuggingFace, with embedded code execution payloads that established reverse shell connections upon loading, that accumulated thousands of downloads before detection.

Neither attack vector is stopped by a TEE. Both attacks happen upstream of the enclave boundary.

A TEE running a poisoned dataset produces a clean attestation. The HSM releases keys because the hardware is legitimate and the software stack is verified. The pipeline runs exactly as designed, faithfully training on data that someone with access to a preprocessing script or a storage bucket modified days or weeks earlier.

This is the AI factory sovereignty gap: you've secured the factory floor, but you haven't secured what gets delivered to the loading dock.

‍

How Akave's Verifiable Storage Layer Closes the Gap?

What makes Akave's storage architecture different, and specifically relevant to confidential AI pipelines, is that provenance verification happens at write time, not at audit time.

When a dataset, checkpoint, or feature file is written to Akave Cloud, it receives a content identifier (CID): a cryptographic hash of the object's exact content at the moment of writing. What this means in an AI factory context: every dataset version that a compliance team or data governance function approves gets a CID at the moment of approval. When the TEE is ready to

ingest that dataset, a CID comparison between the approved version and the version being loaded answers the question the TEE cannot: was this modified after approval? The CID either matches or it doesn't. The proof doesn't depend on logs, on access records, on trusting the preprocessing pipeline, or on anyone's word.

This is the data-side complement to Fortanix's compute-side confidentiality. Fortanix proves the computation was clean. Akave proves the inputs were. Together, they close the loop that neither can close alone.

‍

AI Factory Sovereignty Is a Two-Layer Problem, and Akave Addresses the Data Side

The "sovereign AI" framing that Fortanix, NVIDIA, and NTT DATA are building around is the right frame. But sovereignty has two dimensions that the current confidential computing narrative conflates.

1. Compute sovereignty,.

proving that inference happened in verified hardware within a specified jurisdiction, that the operator couldn't see the data, and that the model wasn't tampered with during execution. Confidential computing with hardware attestation addresses this directly and well.

2. Data sovereignty,

proving that the data the model trained on or inferred from was the authoritative version, approved by the right people, geofenced to the right jurisdiction, and unmodified from the moment of governance sign-off to the moment of ingestion. This is what verifiable storage addresses.

The gap matters most in regulated environments, the same environments driving confidential computing adoption. A healthcare organization deploying AI in a TEE needs to demonstrate to auditors not just that the inference was confidential, but that the patient dataset the model was trained on was the same dataset the IRB reviewed. A financial institution running agentic AI inside an enclave needs to prove that the market data it acted on wasn't modified between the data vendor's delivery and the model's consumption. A government agency using confidential AI for sensitive classification tasks needs a data provenance chain that survives any legal challenge to the AI's outputs.

The EU AI Act's provisions on high-risk AI systems are already pointing at this. Demonstrating compliance isn't just about how the model ran, it's about proving what it ran on. We've written in detail about how Akave's cryptographic data provenance maps directly to EU AI Act requirements; the same logic applies to every sovereignty framework being deployed alongside confidential computing.

Looking Ahead

The confidential computing market is maturing fast. Gartner has placed it among the core infrastructure technologies shaping enterprise AI over the next five years. NVIDIA's hardware attestation capabilities are in active deployment across neoclouds and on-prem AI factories. The regulatory frameworks, EU AI Act, India DPDP, Canada's PIPEDA successor, emerging US federal AI governance, are all converging on the same requirement: demonstrable trust at every layer of the AI supply chain.

Right now, that trust story has a conspicuous gap at the data layer. The narrative is "our AI runs in a verified enclave." The missing chapter is "and the data it ran on was verified before it entered."

The teams that close this loop now, building data provenance into their AI factory architecture before regulators require it, will have an auditable, demonstrable answer when the question arrives. The teams that rely on the compute layer alone will discover that "we had a TEE" is not a complete answer to "show us the data was clean."

‍

Get Started

If you're building AI pipelines on confidential computing infrastructure and want to close the data provenance gap, start with Akave's verifiable storage architecture at akave.com/ai-ml-workloads.

Free trial and S3-compatible integration documentation at akave.com/free-trial and docs.akave.xyz.

FAQ

What is the difference between confidential computing and data provenance for AI?

Confidential computing, using trusted execution environments (TEEs), protects data while it is being processed inside the AI model: inference, training computation, and activation states remain encrypted and isolated from the operator. Data provenance addresses what happened to the data before it entered the computation: whether it was the authoritative version, who approved it, when it was last modified, and whether it was tampered with between governance review and model ingestion. Both matter for AI factory sovereignty. Neither replaces the other.

Does a TEE like Fortanix's protect against training data poisoning?

No. A TEE attests that the computation inside the enclave ran correctly on verified hardware with verified software, it does not verify the integrity of the inputs before they cross the enclave boundary. If a training dataset is poisoned or modified upstream of the TEE, the enclave will faithfully process the corrupted data and produce a clean attestation. Protecting against training data poisoning requires provenance verification at the storage layer, before ingestion, which is what Akave's CID-at-write-time architecture provides.

How does Akave's CID-based provenance work in an AI pipeline?

Akave Cloud assigns a content identifier (CID), a cryptographic hash of the object's exact content, to every dataset, checkpoint, or file at the moment it is written. That CID is anchored to an on-chain audit trail on Avalanche's L1 infrastructure. When a governance team approves a dataset version, its CID captures the approved state. Before ingestion into a training or inference pipeline, a CID comparison confirms the file matches the approved version, without relying on application logs or anyone's attestation about what happened in between.

Can Akave integrate with Fortanix's Confidential AI platform?

Akave exposes a fully S3-compatible API, which means it integrates with any pipeline that reads and writes data through configurable S3 endpoints. Practically, Akave handles the data provenance layer upstream of the TEE boundary, CID verification happens before the dataset crosses into the enclave. This does not require changes to the Fortanix architecture; it's a data-layer addition at the ingestion step. We see this as a clear partnership opportunity and are actively engaging with teams building on both platforms.

Isn't training data integrity already covered by data versioning tools like DVC or MLflow?

Data versioning tools provide lineage tracking within the application layer, they record what version of a dataset a run used, based on what the pipeline reported. That's genuinely useful for reproducibility. It does not provide an independent, tamper-evident proof that the dataset version used matches the version that was approved for use. A DVC record is written by the application and stored in application infrastructure. Akave's CID is written at the storage layer and anchored on-chain, independent of the application. The distinction is the same as the one between observability and accountability in AI agent systems: one records what the system said it did, the other proves it.

If we already have a TEE from Fortanix and HSM-based key management, why is additional storage verification needed?

Because HSM-gated key release verifies that the environment requesting the decryption key is legitimate, that the hardware is genuine and the software stack is attested. It does not verify that the data being decrypted is the version your governance function approved. Two things can both be true: the environment is legitimate, and the dataset was modified between governance sign-off and ingestion. Key management and data provenance solve different problems. You need both for a complete AI factory trust model.

What regulatory frameworks require data provenance in addition to confidential compute?

The EU AI Act requires providers of high-risk AI systems to maintain data governance documentation demonstrating that training data was appropriate, traceable, and subject to access controls, which requires provenance evidence beyond compute attestation. India's DPDP Act focuses on data handling and consent, not just processing security. The NIST AI Risk Management Framework emphasizes traceability of data throughout the AI lifecycle. In each case, "we used a TEE" addresses the compute side; proving the data was authoritative and unmodified is a separate, explicitly required demonstration.

Get Started

Modern Infra. Verifiable By Design

Whether you're scaling your AI infrastructure, handling sensitive records, or modernizing your cloud stack, Akave Cloud is ready to plug in. It feels familiar, but works fundamentally better.

Try Risk-free

Meet With Us

Check Out Our Docs ›

Your AI Factory Is Confidential. Your Training Data Isn't.

What Confidential Computing Actually Guarantees, and Akave's Case for What It Doesn't?

Why the Input Boundary Is Where AI Factories Are Actually Vulnerable?

How Akave's Verifiable Storage Layer Closes the Gap?

AI Factory Sovereignty Is a Two-Layer Problem, and Akave Addresses the Data Side

1. Compute sovereignty,.

2. Data sovereignty,

Looking Ahead

Get Started

Further Reading

FAQ

What is the difference between confidential computing and data provenance for AI?

Does a TEE like Fortanix's protect against training data poisoning?

How does Akave's CID-based provenance work in an AI pipeline?

Can Akave integrate with Fortanix's Confidential AI platform?

Isn't training data integrity already covered by data versioning tools like DVC or MLflow?

If we already have a TEE from Fortanix and HSM-based key management, why is additional storage verification needed?

What regulatory frameworks require data provenance in addition to confidential compute?

Modern Infra. Verifiable By Design