The Problem
When you send a message through Poly Chat, it needs to reach the right AI model — fast. Our gateway routes requests across 8+ providers (OpenAI, Anthropic, Google, and more), and every millisecond of routing overhead is a millisecond of latency your users feel.
Traditional approaches use mutexes or read-write locks to protect shared routing state. Under high concurrency, these locks become bottlenecks. We needed something better.
Lock-Free Architecture
We replaced all hot-path locks with lock-free data structures:
- DashMap for concurrent provider registries — sharded internally, no global lock
- ArcSwap for atomic reference updates to routing tables — readers never block
- Crossbeam SegQueue for lock-free request queues between pipeline stages
The result: 10x improvement under contention compared to our previous RwLock-based design.
SIMD-Accelerated Scoring
Route selection scores each provider on cost, latency, and capability match. With 8+ providers and multiple scoring dimensions, this is a perfect fit for SIMD vectorization.
Using wide::f32x8, we evaluate all providers simultaneously in a single vector operation. On AVX2-capable hardware, this gives us a 5.4x speedup over scalar scoring.
Results
- Routing latency: <5ms p99 (down from ~25ms)
- Throughput: 10,000+ req/s on a single node
- Thermal state detection: 0.31ns (320x better than our 100ns target)
The gateway now handles peak traffic without breaking a sweat — and without a single lock on the hot path.