Project

gte-go

active

Pure Go GTE-Small text embeddings — hand-written SIMD, 1 allocation per embed, predictable flat latency.

Overview

Forked from antirez/gte-pure-C. A pure Go implementation of the GTE-Small text embedding model. Produces 384-dimensional, L2-normalized embeddings suitable for similarity search and clustering.

Single static binary. 1 allocation per embed. Predictable flat latency. All matrix operations use hand-written SIMD assembly (AVX2+FMA on amd64, NEON on arm64) — no gonum, no goroutine churn, no CGo in the default build.

How it works

Loads a converted GTE-Small model at startup, tokenises input with a bundled WordPiece tokeniser, runs matrix multiplications through hand-tuned SIMD kernels, and returns mean-pooled L2-normalized embeddings as a Go slice. The hot path has zero goroutine overhead and generates ~700 bytes/s of GC pressure at 100 qps — 10,000× less than a gonum BLAS equivalent.

Features
🔢
384-dim embeddings

Returns L2-normalized 384-dimensional vectors compatible with GTE-Small's training distribution — plug directly into cosine similarity, FAISS, or any vector store.

Hand-written SIMD

AVX2+FMA on amd64, NEON on arm64 — non-temporal matmul, fused dot products, and packed transpose kernels written in Go assembly. No CGo required.

🧊
1 allocation per embed

The entire inference path allocates once (uppercase→lowercase token lowering). For all-lowercase input: 0 allocations. GC pressure is ~700 B/s vs 13 MB/s with gonum BLAS.

📉
Flat latency

No goroutine churn means predictable p50/p99. Batching reduces jitter 3–5× further. Remaining spikes are Go runtime background work, not inference code.

📦
Static binary

Default make produces a fully self-contained static binary with no C dependencies — portable to any amd64 or arm64 target.

Architecture
Go string text input WordPiece tokenizer in Go SIMD Kernels AVX2+FMA / NEON NT matmul, dot, pack 1 alloc, 0 goroutines []float32 384-dim L2-norm .gtemodel weights ~10ms/embed amd64 ~700 B/s GC pressure pure Go + SIMD assembly — no CGo, no gonum, no goroutine churn
Posts
GTE-Small in Go