Before Word2Vec, representing words for ML was rough. One-hot encoding, TF-IDF… none of them captured meaning. Then Mikolov and team at Google published Word2Vec in 2013 and everything changed.
The big idea
Words that appear in similar contexts have similar meanings. “Dog” and “cat” both appear near “pet”, “cute”, “fur”. So their vectors should be close.
Train a simple neural network on a prediction task. The “side effect” is that word vectors learn to encode meaning.
See it in action: Word2Vec Animation
Two flavors
Skip-gram: Given center word, predict context words
“The cat sat on the mat”
- Center: “sat”
- Predict: “The”, “cat”, “on”, “the”
CBOW (Continuous Bag of Words): Given context, predict center
- Context: “The”, “cat”, “on”, “the”
- Predict: “sat”
Skip-gram works better for smaller datasets and rare words. CBOW is faster and works well with frequent words.
The architecture
Surprisingly simple:
- Input layer: one-hot encoded word
- Hidden layer: embedding dimension (typically 100-300)
- Output layer: vocabulary size, softmax
Input (V) → Hidden (D) → Output (V)
The magic happens in the hidden layer. Those weights become your word vectors.
Negative sampling
Don’t compute full softmax. Instead:
- Real context pairs: positive examples
- Random word pairs: negative examples
Only update weights for these few words per example.
# pseudo-code
for center, context in training_data:
loss = -log(sigmoid(dot(center_vec, context_vec)))
# add k negative samples
for neg_word in sample_negatives(k):
loss += -log(sigmoid(-dot(center_vec, neg_vec)))
Much faster. Quality nearly as good.
The famous analogies
“king - man + woman = queen”
This actually works (roughly). Vectors capture semantic relationships.
# assuming we have word vectors
result = model['king'] - model['man'] + model['woman']
most_similar = find_nearest(result) # returns 'queen'
Practical considerations
Window size
Smaller (2-5): captures syntactic similarity Larger (5-10): captures topic/semantic similarity
Embedding dimension
More dimensions = more capacity but slower and needs more data Common: 100-300 for most applications
Minimum count
Words appearing < N times get filtered. Rare words don’t have enough context to learn good vectors.
Training your own
from gensim.models import Word2Vec
sentences = [["the", "cat", "sat"], ["the", "dog", "ran"]]
model = Word2Vec(
sentences,
vector_size=100,
window=5,
min_count=1,
sg=1 # skip-gram
)
# get vector
cat_vec = model.wv['cat']
# similar words
model.wv.most_similar('cat')