A minimal tensor computation framework in pure Go with SIMD assembly and GPU compute, inspired by tinygrad. Lazy tensor DAG with elementwise fusion, pattern-matching graph rewrite, and enough infrastructure to run LLaMA and BERT models from safetensors weights — no Python, no cgo, no ONNX.
The core is a lazy tensor DAG: operations build a computation graph that is fused and optimized before execution. A tinygrad-style pattern matcher applies 16 rewrite rules for kernel fusion. SIMD GEMM kernels (AVX2 on x86, NEON on ARM) handle matrix math, with GPU DevBuf providing device-agnostic buffers that lazy-transfer between CPU and GPU. LLaMA decoding uses RoPE, GQA, KV cache, and SiLU MLP; BERT encoding matches go-gte parity at 10.8ms per embed with zero allocations.
Elementwise fusion gives 2× throughput for chained ops. Pattern matcher + graph rewrite with 16 rules.
AVX2 VGATHERDPS on x86, NEON GEBP on ARM — ported from go-gte.
RoPE, grouped-query attention, KV cache, SiLU MLP, RMS norm — loads safetensors/GPTQ INT4.
GTE-small at go-gte parity — 10.8ms, 0 allocs per embed.
Device-agnostic DevBuf with lazy CPU↔GPU transfer. 8 PTX kernels compiled at runtime.
Pure Go + assembly. No Python, no cgo, no ONNX runtime.