gte-go — rcarmo

Overview

Forked from antirez/gte-pure-C. A pure Go implementation of the GTE-Small text embedding model. Produces 384-dimensional, L2-normalized embeddings suitable for similarity search and clustering.

Single static binary. 1 allocation per embed. Predictable flat latency. All matrix operations use hand-written SIMD assembly (AVX2+FMA on amd64, NEON on arm64) — no gonum, no goroutine churn, no CGo in the default build.

How it works

Loads a converted GTE-Small model at startup, tokenises input with a bundled WordPiece tokeniser, runs matrix multiplications through hand-tuned SIMD kernels, and returns mean-pooled L2-normalized embeddings as a Go slice. The hot path has zero goroutine overhead and generates ~700 bytes/s of GC pressure at 100 qps — 10,000× less than a gonum BLAS equivalent.

Features

🔢

384-dim embeddings

Returns L2-normalized 384-dimensional vectors compatible with GTE-Small's training distribution — plug directly into cosine similarity, FAISS, or any vector store.

⚡

Hand-written SIMD

AVX2+FMA on amd64, NEON on arm64 — non-temporal matmul, fused dot products, and packed transpose kernels written in Go assembly. No CGo required.

🧊

1 allocation per embed

The entire inference path allocates once (uppercase→lowercase token lowering). For all-lowercase input: 0 allocations. GC pressure is ~700 B/s vs 13 MB/s with gonum BLAS.

📉

Flat latency

No goroutine churn means predictable p50/p99. Batching reduces jitter 3–5× further. Remaining spikes are Go runtime background work, not inference code.

📦

Static binary

Default make produces a fully self-contained static binary with no C dependencies — portable to any amd64 or arm64 target.

Architecture

Posts

2025-03-22 GTE-Small in Go