go-ds4 — rcarmo

Overview

A pure Go inference engine for DeepSeek V4 Flash, ported from antirez's single-file C implementation. Single static binary, hand-written AVX2 + NEON SIMD kernels, Vulkan SPIR-V and CUDA PTX GPU compute, OpenAI-compatible API server. Runs the full 128 GB Q2-quantized model via mmap — 1.27 tok/s on CPU, 5.5× faster on GPU.

How it works

The model has 43 layers with 256 routed MoE experts (IQ2_XXS/Q2_K) plus 1 shared expert (Q8_0) per layer. Multi-head Latent Attention with LoRA Q/O projections and compressed KV cache. Hyper-connections provide 4 parallel residual streams with Sinkhorn-normalized mixing. Safetensors weights are mmap'd directly — no conversion step. The inference pipeline: Token → F16 Embed → 43 layers × (HC-pre → Attn → HC-post → MoE → HC-post) → RMSNorm → Logits.

Features

⚡

AVX2 + NEON SIMD

Hand-written assembly GEMM kernels from go-pherence — no cgo.

🖥

CUDA PTX + Vulkan SPIR-V

GPU compute kernels compiled at runtime. 5.5× CPU throughput on CUDA.

🦙

Full DeepSeek V4 Flash

43 layers, 256 MoE experts, MLA attention, hyper-connections, compressed KV cache.

🌐

OpenAI-compatible API

Drop-in /v1/chat/completions server with streaming.

📦

Single static binary

No Python, no cgo, no ONNX — go build and run.

🗺

mmap model loading

128 GB Q2 model served directly from disk — no loading step.

Architecture

Posts

2026-05-11 Notes for May 3-10

2026-05-09 The Local AI Moat