Project

go-ds4

active

Pure Go inference engine for DeepSeek V4 Flash — AVX2 + CUDA PTX, single static binary

Overview

A pure Go inference engine for DeepSeek V4 Flash, ported from antirez's single-file C implementation. Single static binary, hand-written AVX2 + NEON SIMD kernels, Vulkan SPIR-V and CUDA PTX GPU compute, OpenAI-compatible API server. Runs the full 128 GB Q2-quantized model via mmap — 1.27 tok/s on CPU, 5.5× faster on GPU.

How it works

The model has 43 layers with 256 routed MoE experts (IQ2_XXS/Q2_K) plus 1 shared expert (Q8_0) per layer. Multi-head Latent Attention with LoRA Q/O projections and compressed KV cache. Hyper-connections provide 4 parallel residual streams with Sinkhorn-normalized mixing. Safetensors weights are mmap'd directly — no conversion step. The inference pipeline: Token → F16 Embed → 43 layers × (HC-pre → Attn → HC-post → MoE → HC-post) → RMSNorm → Logits.

Features
AVX2 + NEON SIMD

Hand-written assembly GEMM kernels from go-pherence — no cgo.

🖥
CUDA PTX + Vulkan SPIR-V

GPU compute kernels compiled at runtime. 5.5× CPU throughput on CUDA.

🦙
Full DeepSeek V4 Flash

43 layers, 256 MoE experts, MLA attention, hyper-connections, compressed KV cache.

🌐
OpenAI-compatible API

Drop-in /v1/chat/completions server with streaming.

📦
Single static binary

No Python, no cgo, no ONNX — go build and run.

🗺
mmap model loading

128 GB Q2 model served directly from disk — no loading step.

Architecture
Safetensors weights 128 GB Q2 via mmap Inference engine 43-layer pipeline MLA Attention LoRA + KV cache MoE FFN 256 experts top-6 Hyper-conn 4-stream mixing Compute backends AVX2/NEON SIMD GEMM CUDA PTX GPU kernels OpenAI API /v1/chat/completions mmap DeepSeek V4 Flash inference — pure Go, AVX2 + CUDA PTX
Posts
Notes for May 3-10
The Local AI Moat