Embeddings are everywhere in modern ML. Words, sentences, images, users, products, songs… anything can become an embedding. But what are they actually, and why do we need them?
The core idea
An embedding converts something discrete (like a word) into a list of numbers (a vector) where similar things end up close together.
Cat → [0.2, -0.5, 0.8, ...]
Dog → [0.3, -0.4, 0.7, ...]
Car → [-0.8, 0.3, -0.2, ...]
Cat and dog vectors are close to each other (both are pets). Car is far from both. The embedding has captured meaning!
Interactive demo: Embeddings Animation - see how similar items cluster together in embedding space.
Why not just use one-hot encoding?
The naive approach - one-hot encoding:
cat = [1, 0, 0, 0, ...] # 10,000 dims for 10,000 word vocabulary
dog = [0, 1, 0, 0, ...]
car = [0, 0, 1, 0, ...]
This has serious problems:
- Huge vectors - vocabulary of 50k words = 50k dimensions
- All words equally distant - “cat” is as far from “dog” as from “refrigerator”
- No semantic meaning - the model can’t see that cat and dog are related
- Wasteful - 99.99% zeros
Embeddings solve ALL of these. You get dense 100-1000 dimensional vectors that capture meaning.
Word embeddings: where it started
Word2Vec, GloVe, and FastText showed the world that embeddings work. The key insight: words appearing in similar contexts have similar meanings.
“The ___ sat on the mat” → probably cat, dog, baby, etc.
Training on billions of sentences, the model learns that these words should have similar vectors.
from gensim.models import Word2Vec
# Similar words cluster together
model.wv.most_similar('king')
# [('queen', 0.8), ('prince', 0.7), ('monarch', 0.6), ...]
The famous limitation: one vector per word. “Bank” (river) and “bank” (financial institution) get the same embedding. Context-aware models like BERT fix this.
Sentence and document embeddings
Individual words are great, but often you need to represent entire sentences or documents.
Simple approach: average the word vectors
sentence_vec = np.mean([word_vec(w) for w in sentence])
Works okay for some tasks, but loses word order completely. “Dog bites man” = “Man bites dog” - not ideal!
Better approach: models trained specifically for sentence similarity
- Sentence-BERT
- Universal Sentence Encoder
- E5, BGE (recent and good)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(['This is sentence one', 'Another sentence'])
Contextual embeddings
BERT and friends give different vectors based on context.
“I sat by the river bank” → bank_vector_1 “I went to the bank to deposit money” → bank_vector_2
Different vectors! Context matters.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')
# each token gets context-dependent vector
outputs = model(**tokenizer("hello world", return_tensors='pt'))
embeddings = outputs.last_hidden_state # [1, seq_len, 768]
Image embeddings
CNN or Vision Transformer extracts features. Last layer before classification head = image embedding.
from torchvision.models import resnet50
model = resnet50(pretrained=True)
# remove classification head
model = torch.nn.Sequential(*list(model.children())[:-1])
# image → 2048-dim vector
embedding = model(image).squeeze()
Or use CLIP for multi-modal embeddings (images and text in same space).
Using embeddings
Similarity search
Find nearest neighbors in embedding space.
from sklearn.metrics.pairwise import cosine_similarity
# find most similar to query
similarities = cosine_similarity([query_emb], all_embeddings)
top_k = np.argsort(similarities[0])[-k:]
Clustering
Group similar items.
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=10).fit_predict(embeddings)
Embeddings make sense now? Give ML Animations a star ⭐ and share this post with your ML friends!