A complete walkthrough of HarchOS internals — from the custom scheduler and GPU topology awareness to the SENSE/THINK/ACT pipeline that orchestrates 1,798 GPUs across three data centers. No off-the-shelf orchestrator could handle our requirements, so we built one.

HarchOS is a distributed AI operating system that manages 1,798 GPUs across three geographically distributed data centers — Dakhla, Tangier, and Dakar — with sub-12ms inference latency, federated scheduling, and sovereign data residency enforcement. It was not built because we wanted to write an operating system. It was built because no existing system could meet all four of those constraints simultaneously. Kubernetes excels at container orchestration but has no native concept of GPU topology. Slurm handles batch scheduling but assumes a single cluster. Ray supports distributed training but ignores data sovereignty. We needed all four capabilities in one system. This article is the complete technical walkthrough of how we built it, what we got right, what we got wrong, and what we would do differently.
The architecture is organized into five layers. The Resource Layer abstracts physical hardware into logical resource objects with rich metadata: GPU model, memory capacity, NVLink topology, PCIe bus assignment, NUMA node, cooling capacity, and jurisdiction tag. Each GPU in the fleet is represented as a first-class object in the resource graph, with edges representing NVLink connections (800 GB/s bidirectional), PCIe switch hierarchy, and network path to every other GPU in the fleet. This graph is maintained in memory across all scheduler instances and updated in real-time as hardware joins, leaves, or degrades. The query API supports topology-constrained allocation: "find me 8 GPUs on the same NVLink domain with at least 40GB VRAM each, within Moroccan jurisdiction." The scheduler resolves this query against the live topology graph and returns a placement plan that minimizes cross-domain traffic while respecting jurisdictional constraints.
The Scheduling Layer implements a two-tier architecture. The global tier runs on a dedicated scheduler cluster (3 nodes, Raft consensus) and makes placement decisions for workloads that span multiple hubs or require cross-hub coordination — distributed training jobs, federated inference, and data migration tasks. The local tier runs independently at each hub and makes placement decisions for single-hub workloads — the vast majority of inference and fine-tuning jobs. The global scheduler communicates with local schedulers through a gossip protocol that propagates capacity information every 500ms and placement decisions within 2 seconds. This design ensures that single-hub workloads are scheduled with sub-100ms latency (no cross-hub coordination required) while multi-hub workloads receive globally optimal placement within 5 seconds. The scheduling algorithm itself is a variant of weighted fair queuing with topology-aware bin-packing. Pure bin-packing maximizes utilization but causes head-of-line blocking for latency-sensitive workloads. Pure fair queuing ensures fairness but wastes capacity on fragmentation. Our hybrid approach assigns each workload a weight proportional to its latency sensitivity and priority class, then solves a constrained optimization that maximizes weighted throughput subject to topology and jurisdiction constraints. The solver runs in under 50ms for the current fleet size and scales sub-linearly with additional GPUs.
The Pipeline Layer implements the SENSE-THINK-ACT architecture as a DAG execution engine. Each pipeline stage is modeled as a node in a directed acyclic graph, with edges representing data flow dependencies and latency constraints. The SENSE node ingests data at up to 10M events/second and emits windowed aggregates. The THINK node runs inference on those aggregates using models loaded from a model registry that supports hot-swapping without pipeline interruption. The ACT node translates inference outputs into actions, with configurable execution semantics: fire-and-forget for non-critical actions, confirm-and-execute for safety-critical actions that require human approval, and transactional for actions that must complete atomically with state updates. The DAG engine handles backpressure natively — if the THINK node falls behind, the SENSE node's output buffer expands dynamically up to a configurable limit, after which it applies load-shedding according to priority classes. This prevented a cascade failure during a traffic spike in November 2025 that would have taken down a naive pipeline: SENSE absorbed the spike, THINK caught up within 90 seconds, and no data was lost.
The Sovereignty Layer enforces data residency constraints at the scheduling and execution level. Every data object in HarchOS carries a jurisdiction tag — a set of countries where the data may be processed. The scheduling layer reads these tags and excludes hubs outside the permitted jurisdiction from placement consideration. At execution time, a runtime check verifies that the workload's jurisdiction tag matches the hub's location before allowing data to be loaded into GPU memory. If a jurisdiction violation is detected — which can happen if a hub's geopolitical status changes or a tag is updated — the workload is immediately paused, its memory is scrubbed, and an alert is dispatched to the operations team. This is not a soft constraint. It is a hard enforcement mechanism that guarantees compliance by design, not by policy.
The Observability Layer — internally called SENTINEL — replaces Prometheus, Grafana, and Datadog with a sovereign telemetry stack. All metrics, logs, and traces are collected, stored, and visualized within Harch Intelligence's network perimeter. The collection agent is a Rust-based daemon that runs on every compute node and emits metrics via a custom protocol over QUIC, achieving sub-millisecond collection latency with minimal overhead (less than 0.3% CPU, less than 50MB RAM). Storage is a custom time-series database built on Apache Arrow, achieving 12x compression versus raw JSON and supporting sub-second queries across 90 days of metrics data. The visualization layer is a React application served from the same network perimeter. The total engineering investment in SENTINEL was approximately 8 engineer-months — a significant cost, but one that ensures no foreign company has visibility into our infrastructure's performance, capacity, or failure modes.
What would we do differently? Three things. First, we would have invested in topology-aware scheduling earlier — the first six months of operation used a simpler algorithm that left 30% of GPU capacity underutilized due to fragmentation. Second, we would have built SENTINEL from day one instead of starting with Prometheus and migrating later; the migration cost more engineering time than the original build. Third, we would have designed the gossip protocol with a fallback to centralized coordination for edge cases where the distributed consensus fails — we experienced a 12-minute scheduling outage in September 2025 when a network partition caused two scheduler instances to diverge. The fix was straightforward (version-vector reconciliation), but the incident exposed a fragility we should have anticipated. HarchOS is not perfect. But it works — 99.97% scheduling availability, 94% GPU utilization, and zero data sovereignty violations across 18 months of production operation. For a system this complex, that is a foundation worth building on.
Related Topics
More Technical Posts