Project

go-gte

active

Pure Go GTE-Small text embeddings — hand-written SIMD, 1 allocation per embed, predictable flat latency.

Overview

Forked from antirez/gte-pure-C. A pure Go implementation of the GTE-Small text embedding model. Produces 384-dimensional, L2-normalized embeddings suitable for similarity search and clustering.

Single static binary. 1 allocation per embed. Predictable flat latency. All matrix operations use hand-written SIMD assembly (AVX2+FMA on amd64, NEON on arm64) — no gonum, no goroutine churn, no CGo in the default build.

How it works

Loads a converted GTE-Small model at startup, tokenises input with a bundled WordPiece tokeniser, runs matrix multiplications through hand-tuned SIMD kernels, and returns mean-pooled L2-normalized embeddings as a Go slice. The hot path has zero goroutine overhead and generates ~700 bytes/s of GC pressure at 100 qps — 10,000× less than a gonum BLAS equivalent.

Features
🔢
384-dim embeddings

Returns L2-normalized 384-dimensional vectors compatible with GTE-Small's training distribution — plug directly into cosine similarity, FAISS, or any vector store.

Hand-written SIMD

AVX2+FMA on amd64, NEON on arm64 — non-temporal matmul, fused dot products, and packed transpose kernels written in Go assembly. No CGo required.

🧊
1 allocation per embed

The entire inference path allocates once (uppercase→lowercase token lowering). For all-lowercase input: 0 allocations. GC pressure is ~700 B/s vs 13 MB/s with gonum BLAS.

📉
Flat latency

No goroutine churn means predictable p50/p99. Batching reduces jitter 3–5× further. Remaining spikes are Go runtime background work, not inference code.

📦
Static binary

Default make produces a fully self-contained static binary with no C dependencies — portable to any amd64 or arm64 target.

Architecture
Input text query or document GGUF weights 23 MB model file Embed pipeline single static binary Tokenizer WordPiece 6L Attention BERT encoder Mean pool L2 normalize 384-dim vector similarity · clustering Pure Go GTE-small — single binary, 1 alloc per embed, flat latency
Posts
GTE-Small in Go