Embeddings are everywhere now. Words, sentences, images, users, products… anything can be an embedding. But what are they actually?

The basic idea

Convert discrete things into continuous vectors where similar things are close together.

Cat → [0.2, -0.5, 0.8, …] Dog → [0.3, -0.4, 0.7, …] Car → [-0.8, 0.3, -0.2, …]

Cat and dog vectors are close. Car is far from both. That’s the goal.

Embedding Space

Explore: Embeddings Animation

Why not one-hot?

One-hot encoding for words:

cat = [1, 0, 0, 0, ...]  # 10000 dims for 10000 words
dog = [0, 1, 0, 0, ...]
car = [0, 0, 1, 0, ...]

Problems:

  • Huge dimensionality
  • All words equally distant
  • No semantic information
  • Sparse (wasteful)

Embeddings fix all of these.

Word embeddings

The first big success. Word2Vec, GloVe, FastText.

Trained on “words appearing in similar contexts have similar meanings.”

from gensim.models import Word2Vec

# similar words have similar vectors
model.wv.most_similar('king')
# [('queen', 0.8), ('prince', 0.7), ...]

Limitation: one vector per word. “Bank” (river) and “bank” (financial) get same vector.

Sentence embeddings

Words are nice but you often need sentence or document level.

Simple approach: average word embeddings

sentence_vec = np.mean([word_vec(w) for w in sentence])

Works okayish but loses word order.

Better: models trained for sentence similarity

  • Sentence-BERT
  • Universal Sentence Encoder
  • E5, BGE (recent and good)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(['This is sentence one', 'Another sentence'])

Contextual embeddings

BERT and friends give different vectors based on context.

“I sat by the river bank” → bank_vector_1 “I went to the bank to deposit money” → bank_vector_2

Different vectors! Context matters.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# each token gets context-dependent vector
outputs = model(**tokenizer("hello world", return_tensors='pt'))
embeddings = outputs.last_hidden_state  # [1, seq_len, 768]

Image embeddings

CNN or Vision Transformer extracts features. Last layer before classification head = image embedding.

from torchvision.models import resnet50

model = resnet50(pretrained=True)
# remove classification head
model = torch.nn.Sequential(*list(model.children())[:-1])

# image → 2048-dim vector
embedding = model(image).squeeze()

Or use CLIP for multi-modal embeddings (images and text in same space).

Using embeddings

Similarity search

Find nearest neighbors in embedding space.

from sklearn.metrics.pairwise import cosine_similarity

# find most similar to query
similarities = cosine_similarity([query_emb], all_embeddings)
top_k = np.argsort(similarities[0])[-k:]

Clustering

Group similar items.

from sklearn.cluster import KMeans

clusters = KMeans(n_clusters=10).fit_predict(embeddings)