Forked from antirez/gte-pure-C. A pure Go implementation of the GTE-Small text embedding model. Produces 384-dimensional, L2-normalized embeddings suitable for similarity search and clustering.
Single static binary. 1 allocation per embed. Predictable flat latency. All matrix operations use hand-written SIMD assembly (AVX2+FMA on amd64, NEON on arm64) — no gonum, no goroutine churn, no CGo in the default build.
This seemed like a great follow-up to asterisk and drove me to try out Go assembly. I'm not overly pleased with non-GPU performance, but I did learn a lot from contrasting SIMD and NVIDIA support that eventually led to go-pherence, even if my potato RTX3060 provides barely any speedup.
I'm pretty happy with it, and it was a great learning experience even if I haven't yet sorted out the vector database I will be using it with.
Like Salvatore's original, it loads a converted GTE-Small model at startup, tokenises input with a bundled WordPiece tokeniser, runs matrix multiplications (through hand-tuned SIMD kernels), and returns mean-pooled L2-normalized embeddings as a Go slice. After banging on the profiler for a few days, the hot path has zero goroutine overhead and generates ~700 bytes/s of GC pressure at 100 qps — 10,000× less than a gonum BLAS equivalent (which was what AI suggested I use, without considering all the angles).
Returns L2-normalized 384-dimensional vectors compatible with GTE-Small's training distribution — plug directly into cosine similarity, FAISS, or any vector store.
AVX2+FMA on amd64, NEON on arm64 — non-temporal matmul, fused dot products, and packed transpose kernels written in Go assembly. No CGo required.
The entire inference path allocates once (uppercase→lowercase token lowering). For all-lowercase input: 0 allocations. GC pressure is ~700 B/s vs 13 MB/s with gonum BLAS.
No goroutine churn means predictable p50/p99. Batching reduces jitter 3–5× further. Remaining spikes are Go runtime background work, not inference code.
Default make produces a fully self-contained static binary with no C dependencies — portable to any amd64 or arm64 target.