go-gte — rcarmo

Overview

Forked from antirez/gte-pure-C. A pure Go implementation of the GTE-Small text embedding model. Produces 384-dimensional, L2-normalized embeddings suitable for similarity search and clustering.

Single static binary. 1 allocation per embed. Predictable flat latency. All matrix operations use hand-written SIMD assembly (AVX2+FMA on amd64, NEON on arm64) — no gonum, no goroutine churn, no CGo in the default build.

Motivation

This seemed like a great follow-up to asterisk and drove me to try out Go assembly. I'm not overly pleased with non-GPU performance, but I did learn a lot from contrasting SIMD and NVIDIA support that eventually led to go-pherence, even if my potato RTX3060 provides barely any speedup.

I'm pretty happy with it, and it was a great learning experience even if I haven't yet sorted out the vector database I will be using it with.

How it works

Like Salvatore's original, it loads a converted GTE-Small model at startup, tokenises input with a bundled WordPiece tokeniser, runs matrix multiplications (through hand-tuned SIMD kernels), and returns mean-pooled L2-normalized embeddings as a Go slice. After banging on the profiler for a few days, the hot path has zero goroutine overhead and generates ~700 bytes/s of GC pressure at 100 qps — 10,000× less than a gonum BLAS equivalent (which was what AI suggested I use, without considering all the angles).

Features

🔢

384-dim embeddings

Returns L2-normalized 384-dimensional vectors compatible with GTE-Small's training distribution — plug directly into cosine similarity, FAISS, or any vector store.

⚡

Hand-written SIMD

AVX2+FMA on amd64, NEON on arm64 — non-temporal matmul, fused dot products, and packed transpose kernels written in Go assembly. No CGo required.

🧊

1 allocation per embed

The entire inference path allocates once (uppercase→lowercase token lowering). For all-lowercase input: 0 allocations. GC pressure is ~700 B/s vs 13 MB/s with gonum BLAS.

📉

Flat latency

No goroutine churn means predictable p50/p99. Batching reduces jitter 3–5× further. Remaining spikes are Go runtime background work, not inference code.

📦

Static binary

Default make produces a fully self-contained static binary with no C dependencies — portable to any amd64 or arm64 target.

Architecture

Posts

2026-05-03 Notes for April 27 – May 3

2025-03-22 GTE-Small in Go