Before Word2Vec, representing words for ML was rough. One-hot encoding, TF-IDF… none of them captured meaning. Then Mikolov and team at Google published Word2Vec in 2013 and everything changed.

The big idea

Words that appear in similar contexts have similar meanings. “Dog” and “cat” both appear near “pet”, “cute”, “fur”. So their vectors should be close.

Train a simple neural network on a prediction task. The “side effect” is that word vectors learn to encode meaning.

Word2Vec Training

See it in action: Word2Vec Animation

Two flavors

Skip-gram: Given center word, predict context words

“The cat sat on the mat”

  • Center: “sat”
  • Predict: “The”, “cat”, “on”, “the”

CBOW (Continuous Bag of Words): Given context, predict center

  • Context: “The”, “cat”, “on”, “the”
  • Predict: “sat”

Skip-gram works better for smaller datasets and rare words. CBOW is faster and works well with frequent words.

The architecture

Surprisingly simple:

  • Input layer: one-hot encoded word
  • Hidden layer: embedding dimension (typically 100-300)
  • Output layer: vocabulary size, softmax
Input (V) → Hidden (D) → Output (V)

The magic happens in the hidden layer. Those weights become your word vectors.

Negative sampling

Don’t compute full softmax. Instead:

  • Real context pairs: positive examples
  • Random word pairs: negative examples

Only update weights for these few words per example.

# pseudo-code
for center, context in training_data:
    loss = -log(sigmoid(dot(center_vec, context_vec)))
    # add k negative samples
    for neg_word in sample_negatives(k):
        loss += -log(sigmoid(-dot(center_vec, neg_vec)))

Much faster. Quality nearly as good.

The famous analogies

“king - man + woman = queen”

This actually works (roughly). Vectors capture semantic relationships.

# assuming we have word vectors
result = model['king'] - model['man'] + model['woman']
most_similar = find_nearest(result)  # returns 'queen'

Practical considerations

Window size

Smaller (2-5): captures syntactic similarity Larger (5-10): captures topic/semantic similarity

Embedding dimension

More dimensions = more capacity but slower and needs more data Common: 100-300 for most applications

Minimum count

Words appearing < N times get filtered. Rare words don’t have enough context to learn good vectors.

Training your own

from gensim.models import Word2Vec

sentences = [["the", "cat", "sat"], ["the", "dog", "ran"]]

model = Word2Vec(
    sentences,
    vector_size=100,
    window=5,
    min_count=1,
    sg=1  # skip-gram
)

# get vector
cat_vec = model.wv['cat']

# similar words
model.wv.most_similar('cat')