When every customer runs arbitrary CUDA kernels on shared hardware, network isolation is existential. We describe our SPIFFE-based identity layer, eBPF firewall, and runtime threat detection.

Multi-tenant AI infrastructure presents a security challenge that traditional cloud providers have not fully solved: customers run arbitrary code on shared GPUs. Not arbitrary containers or arbitrary web requests — arbitrary CUDA kernels with direct access to GPU memory, DMA engines, and in some cases, NVLink interconnects. A malicious kernel could attempt to read residual data from GPU memory previously used by another tenant, probe the PCIe bus for adjacent devices, or exfiltrate data through side channels that exploit shared cache hierarchies. This is not theoretical — academic researchers have demonstrated GPU memory residue attacks, PCIe bus snooping, and cache-based side channels on shared NVIDIA hardware. In a conventional cloud environment, these risks are mitigated by VM isolation and hypervisor enforcement. In a GPU-native environment where the whole point is direct hardware access, those mitigations do not apply. We had to build a security model from first principles, and that model is zero-trust networking.
Zero-trust networking means that no entity — internal or external — is trusted by default. Every request, every connection, and every data flow must be authenticated, authorized, and encrypted, regardless of network location. In our architecture, this principle is implemented through three layers: identity, enforcement, and detection. The identity layer is built on SPIFFE (Secure Production Identity Framework for Everyone). Every workload in HarchOS — every container, every inference service, every training job — receives a SPIFFE Verifiable Identity Document (SVID) at startup. The SVID encodes the workload's identity (who it is), its capabilities (what it is allowed to do), and its jurisdiction (where its data may flow). Mutual TLS is mandatory for all inter-service communication, with certificate rotation every 24 hours. A workload without a valid SVID cannot establish any network connection — the enforcement layer drops its packets at the host level before they reach the network interface.
The enforcement layer uses eBPF (extended Berkeley Packet Filter) programs loaded into the kernel of every host to implement per-connection firewall rules. When a workload attempts to establish a network connection, the eBPF program checks the workload's SVID against the connection's destination, port, and protocol. If the connection is authorized, it proceeds. If not, the packet is dropped and the event is logged. The eBPF approach has three advantages over traditional iptables-based firewalls. First, performance: eBPF programs run in the kernel without context switches, achieving packet filtering at 10+ million packets per second with sub-microsecond overhead. Second, granularity: eBPF can filter on arbitrary packet metadata, including SPIFFE identity, which iptables cannot inspect. Third, dynamism: eBPF programs can be updated without restarting the host or disrupting existing connections, enabling real-time policy changes in response to security events. The policy engine that generates eBPF rules runs as a control plane service, consuming identity and authorization data from the SPIFFE federation and emitting updated eBPF programs within 5 seconds of any policy change.
GPU memory isolation is the hardest problem, because GPU hardware does not provide the same memory protection guarantees as CPU hardware. When a CUDA kernel runs on a GPU, it can access the entire GPU memory space unless the driver enforces segmentation. NVIDIA's MPS (Multi-Process Service) provides memory isolation between concurrent kernels, but with a 5-15% performance overhead that is unacceptable for our inference workloads. Our solution is a combination of software and operational controls. Software: after every workload completes, the GPU driver performs a cryptographic erase — writing random data to all GPU memory, then reading it back to verify — before the GPU is reassigned to a new workload. This takes approximately 800ms for an 80GB A100, which is amortized into the scheduling overhead. Operational: workloads with different security classifications (for example, financial data versus public web data) are never scheduled on the same physical GPU, even with memory isolation, to eliminate the risk of hardware side-channel attacks. This reduces GPU utilization by approximately 12% compared to unconstrained scheduling, but it eliminates an entire category of cross-tenant data leakage.
The detection layer provides runtime threat detection for the attacks that pass through identity and enforcement. We run three detectors. The first is a network anomaly detector that models normal traffic patterns using a variational autoencoder trained on 90 days of historical traffic data. Connections that deviate significantly from the learned distribution — unusual ports, unexpected destinations, atypical data volumes — are flagged for investigation. The second is a GPU behavioral monitor that tracks CUDA API call patterns from each workload. A workload that makes unusual API calls — for example, attempting to map memory allocated by a different process, or probing the PCIe configuration space — is immediately terminated and its SVID is revoked. The third is a data exfiltration detector that monitors outbound network traffic from each workload against the data volume expected for its declared function. A model inference service that suddenly begins uploading gigabytes of data to an external endpoint is almost certainly compromised, and the detector terminates the connection within 500ms of detecting the anomaly.
Zero-trust networking is not a feature you add to existing infrastructure. It is a design principle that must be embedded from the foundation. Retrofitting zero-trust onto a system that was designed with implicit trust requires rewriting most of the networking and security code — which is why so few cloud providers have done it comprehensively. We had the advantage of building from scratch, and we chose to pay the engineering cost upfront rather than accumulating security debt that would need to be repaid with interest after a breach. The result is a multi-tenant AI platform where every connection is authenticated, every flow is authorized, every GPU is scrubbed between tenants, and every anomaly is detected in real time. Is it perfectly secure? No system is. But the attack surface is orders of magnitude smaller than a conventional multi-tenant GPU cloud, and the detection capability means that even a novel attack is likely to be caught before it succeeds.
Related Topics
More Technical Posts