Before Word2Vec, representing words for ML was painful. One-hot encoding treated every word as equally different. TF-IDF captured some statistics but no meaning. Then in 2013, Mikolov and team at Google published Word2Vec, and suddenly we could do things like:

king - man + woman ≈ queen

Wait, what? Word arithmetic that actually works? Let’s see how.

The brilliant insight

Words that appear in similar contexts have similar meanings.

Think about it: “dog” and “cat” both appear near words like “pet”, “cute”, “fur”, “feed”. The word “laptop” appears near totally different words. So if we train a model to predict context, “dog” and “cat” vectors will naturally end up close together.

The embeddings aren’t the goal - they’re a “side effect” of training a simple prediction task. And that side effect turned out to be incredibly useful.

Interactive demo: Word2Vec Animation - see how word vectors organize themselves during training.

Two training flavors

Skip-gram: Given a center word, predict the surrounding context words

"The cat sat on the mat"
Center word: "sat"
Predict: "The", "cat", "on", "the"

CBOW (Continuous Bag of Words): Given context words, predict the center

Context: "The", "cat", "on", "the"  
Predict: "sat"

Which is better? Skip-gram works better for smaller datasets and rare words. CBOW is faster and handles frequent words well. In practice, Skip-gram is more popular.

The architecture (surprisingly simple)

It’s basically just one hidden layer:

  • Input: one-hot encoded word (vocabulary size V)
  • Hidden: embedding dimension (typically 100-300)
  • Output: vocabulary size V, with softmax
Input (V) → Hidden (D) → Output (V)

The hidden layer weights ARE your word embeddings. That’s it!

Negative sampling

Don’t compute full softmax. Instead:

  • Real context pairs: positive examples
  • Random word pairs: negative examples

Only update weights for these few words per example.

# pseudo-code
for center, context in training_data:
    loss = -log(sigmoid(dot(center_vec, context_vec)))
    # add k negative samples
    for neg_word in sample_negatives(k):
        loss += -log(sigmoid(-dot(center_vec, neg_vec)))

Much faster. Quality nearly as good.

The famous analogies

“king - man + woman = queen”

This actually works (roughly). Vectors capture semantic relationships.

# assuming we have word vectors
result = model['king'] - model['man'] + model['woman']
most_similar = find_nearest(result)  # returns 'queen'

Practical considerations

Window size

Smaller (2-5): captures syntactic similarity Larger (5-10): captures topic/semantic similarity

Embedding dimension

More dimensions = more capacity but slower and needs more data Common: 100-300 for most applications

Minimum count

Words appearing < N times get filtered. Rare words don’t have enough context to learn good vectors.

Training your own

from gensim.models import Word2Vec

sentences = [["the", "cat", "sat"], ["the", "dog", "ran"]]

model = Word2Vec(
    sentences,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1  # skip-gram
)

# get vector
cat_vec = model.wv['cat']

# similar words
model.wv.most_similar('cat')

Word2Vec intuition unlocked? Consider starring ML Animations and spreading the word on social media!