Forked from antirez/gte-pure-C. A pure Go implementation of the GTE-Small text embedding model. Produces 384-dimensional, L2-normalized embeddings suitable for similarity search and clustering.
Single static binary. 1 allocation per embed. Predictable flat latency. All matrix operations use hand-written SIMD assembly (AVX2+FMA on amd64, NEON on arm64) — no gonum, no goroutine churn, no CGo in the default build.
Loads a converted GTE-Small model at startup, tokenises input with a bundled WordPiece tokeniser, runs matrix multiplications through hand-tuned SIMD kernels, and returns mean-pooled L2-normalized embeddings as a Go slice. The hot path has zero goroutine overhead and generates ~700 bytes/s of GC pressure at 100 qps — 10,000× less than a gonum BLAS equivalent.
Returns L2-normalized 384-dimensional vectors compatible with GTE-Small's training distribution — plug directly into cosine similarity, FAISS, or any vector store.
AVX2+FMA on amd64, NEON on arm64 — non-temporal matmul, fused dot products, and packed transpose kernels written in Go assembly. No CGo required.
The entire inference path allocates once (uppercase→lowercase token lowering). For all-lowercase input: 0 allocations. GC pressure is ~700 B/s vs 13 MB/s with gonum BLAS.
No goroutine churn means predictable p50/p99. Batching reduces jitter 3–5× further. Remaining spikes are Go runtime background work, not inference code.
Default make produces a fully self-contained static binary with no C dependencies — portable to any amd64 or arm64 target.