back to writing
·ai, notes, intuition

embeddings, explained

a plain-english walkthrough of what embeddings actually are, why they work, and the intuitions that carry from word2vec all the way to modern llms.

every modern ai system, under all the cleverness, is doing one thing first: turning messy input into a list of numbers. words, pixels, audio clips, pdf pages — they all get converted into a vector before the model is allowed to touch them.

those vectors are embeddings. and once you understand what they really are, a lot of things in ml stop feeling like magic.

this is the post i wish i had when i first ran into the word.

#the core idea in one sentence

an embedding is a point in space that represents some piece of meaning. similar things live near each other. different things live far apart.

that's it. the rest of this post is just unpacking why that's useful and how it actually works.

#why points in space at all

computers don't understand words. they understand numbers. so before anything else, we need a way to turn a word like cat into something a neural network can do math on.

the naive version is one-hot encoding: pick a vocabulary of, say, 50,000 words, and represent each word as a vector with 49,999 zeros and a single 1 somewhere.

cat  = [0, 0, 1, 0, 0, ..., 0]
dog  = [0, 1, 0, 0, 0, ..., 0]
car  = [0, 0, 0, 0, 1, ..., 0]

this works, technically. but it's terrible for two reasons:

  • it's huge. every word is a 50,000-dimensional vector of mostly zeros.
  • nothing is close to anything. cat and dog are just as far apart as cat and car. the representation throws away the very thing we care about — similarity.

we want a representation where the geometry tells you something. where cat and dog end up near each other, and car ends up somewhere else entirely.

the trick is to give up on "one dimension per word" and instead pick a small, fixed number of dimensions (say, 300) and learn where each word should sit in that space.

#how you actually learn them

the classic answer is word2vec (mikolov et al., 2013). the setup is almost embarrassingly simple:

you shall know a word by the company it keeps. — j.r. firth, 1957

take a huge pile of text. for every word, look at the words around it. train a small neural network to predict context from word (or word from context). the weights it ends up with are the embeddings.

# skip-gram, pseudo-code
for sentence in corpus:
    for word, context_word in pairs_within_window(sentence, w=5):
        # push the embedding of `word` closer to `context_word`
        # push it away from random other words
        update(embeddings, word, context_word)

the network doesn't "know" what cat means. it just knows that cat tends to show up near purr, litter, vet, kitten. and dog shows up near bark, leash, vet, puppy. they share a lot of neighbors.

after enough passes, the optimizer has pushed cat and dog into nearby regions of the space — not because anyone told it to, but because that's the cheapest way to compress the co-occurrence patterns it's seeing.

this is the key move to internalize: meaning falls out of the structure of the data. you didn't define it. the geometry discovered it.

#the famous arithmetic

once you have embeddings, you can do things like this, which feels uncanny the first time you see it work:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

what's happening: the direction from man to woman encodes "gender". the direction from king to queen encodes the same thing. so if you start at king and walk in the "man → woman" direction, you end up near queen.

other directions in the space turn out to encode things like:

  • singular → plural
  • country → capital
  • present tense → past tense
  • company → ceo

nobody labeled any of this. these directions emerged from the training objective. this is the part of word embeddings that made a lot of people take deep learning seriously in the mid-2010s.

#cosine similarity: the default way to measure "close"

once words are points in space, you need a way to ask: how similar are these two?

euclidean distance works, but it's sensitive to vector length. cosine similarity is the standard choice. it only looks at the angle between two vectors:

def cosine(a, b):
    return (a @ b) / (norm(a) * norm(b))
  • 1.0 = same direction (very similar)
  • 0.0 = orthogonal (unrelated)
  • -1.0 = opposite (rare in practice)

when you hear people talk about "vector search" or "semantic search" — the beating heart of rag systems, recommendation engines, duplicate detection — it's almost always cosine similarity over learned embeddings.

#beyond words: everything becomes a vector

here's where it gets fun. the same trick works for anything.

  • sentences and documents. pool the token vectors, or (better) run them through a transformer and use a dedicated [CLS] token's final representation. now you can ask "find me the doc most similar to this query" in one dot product.
  • images. run them through a cnn or vision transformer, chop off the last layer, and treat the penultimate activations as an embedding. images of cats cluster together. images of sunsets cluster together. clip goes a step further and trains image and text embeddings into the same space so you can search images with a sentence.
  • users and items. in recommenders, every user gets an embedding, every product gets an embedding, and "likely to click" is just a dot product. this is how netflix, spotify, and every feed-based app under the hood works.
  • code. functions, commits, even entire repos get embedded. vector search over a codebase is the first half of almost every "ai ide" feature.
  • proteins, molecules, graphs. the same idea shows up in alphafold (residue embeddings), drug discovery (molecular fingerprints as vectors), and graph neural networks.

the pattern is always the same: pick a task whose loss forces similar things to end up near each other in the vector space. the embedding falls out as a byproduct.

#contextual embeddings: where llms come in

word2vec has one big limitation. the word bank gets exactly one vector — even though "river bank" and "bank account" are very different concepts.

modern embeddings are contextual. the same word gets a different vector depending on the surrounding text. this is what transformers unlocked (see attention is all you need — a walkthrough): every token's representation is computed as a function of every other token in the sequence.

"i sat on the river bank"        →  bank ≈ [earth, shore, water, ...]
"i deposited it at the bank"     →  bank ≈ [money, account, vault, ...]

when people say "llm embeddings" today, they almost always mean this — the output of a transformer stack, taken at some specific layer, used as a representation of a chunk of text.

modern embedding models (openai's text-embedding-3, cohere's embed, bge, e5, nomic, voyage, etc.) are just transformers that have been fine-tuned so that similar passages end up with similar vectors, using contrastive training. the idea hasn't really changed since word2vec. the execution has.

#where embeddings show up in real systems

if you're building with ai today, you're using embeddings whether you realize it or not. the three most common places:

  1. retrieval. rag systems embed your documents, embed the user's query, and use cosine similarity to find the top-k chunks to stuff into an llm's context window. the quality of the whole system often lives or dies on the quality of the embeddings.
  2. classification. instead of training a fresh classifier, embed your inputs and run a cheap model (logistic regression, knn) on the vectors. often works surprisingly well with very little data.
  3. clustering and dedup. embed a pile of documents, cluster them, and you've got topics. embed user messages and you can detect near-duplicates, spam, or repeated support tickets.

and more subtly — every chatbot, every image generator, every recommender, every code-completion tool is shuttling embeddings around inside its forward pass. the user just never sees them.

#the mental model to keep

if you take one thing from this post, let it be this:

a neural network is, mostly, a machine for learning embeddings. the last layer of a classifier is a linear projection on top of embeddings. attention is a weighted average over embeddings. rag is a nearest-neighbor lookup over embeddings. fine-tuning is moving embeddings around. multimodal models are aligning embeddings across input types.

once you internalize that the whole field is "turn stuff into vectors, then do geometry", a lot of the jargon becomes navigable.

if you want to go deeper:

  • "efficient estimation of word representations in vector space" — the word2vec paper (arxiv.org/abs/1301.3781). short and readable.
  • "the illustrated word2vec" by jay alammar — the best visual explanation i've seen.
  • sentence-bert — the paper that made modern sentence embeddings practical.
  • openai's embeddings guide — for the pragmatic api-level view.

embeddings are the thing beneath the thing. once you see them everywhere, you stop being surprised by how much of ai is "just" clever geometry on learned vectors.

which, honestly, is the most surprising part.

[n] now