A pure Go inference engine for DeepSeek V4 Flash, ported from antirez's single-file C implementation. Single static binary, hand-written AVX2 + NEON SIMD kernels, Vulkan SPIR-V and CUDA PTX GPU compute, OpenAI-compatible API server. Runs the full 128 GB Q2-quantized model via mmap — 1.27 tok/s on CPU, 5.5× faster on GPU.
The model has 43 layers with 256 routed MoE experts (IQ2_XXS/Q2_K) plus 1 shared expert (Q8_0) per layer. Multi-head Latent Attention with LoRA Q/O projections and compressed KV cache. Hyper-connections provide 4 parallel residual streams with Sinkhorn-normalized mixing. Safetensors weights are mmap'd directly — no conversion step. The inference pipeline: Token → F16 Embed → 43 layers × (HC-pre → Attn → HC-post → MoE → HC-post) → RMSNorm → Logits.
Hand-written assembly GEMM kernels from go-pherence — no cgo.
GPU compute kernels compiled at runtime. 5.5× CPU throughput on CUDA.
43 layers, 256 MoE experts, MLA attention, hyper-connections, compressed KV cache.
Drop-in /v1/chat/completions server with streaming.
No Python, no cgo, no ONNX — go build and run.
128 GB Q2 model served directly from disk — no loading step.