Before embeddings there was Bag of Words. Still useful, still relevant for some tasks. And understanding it helps understand why newer methods are better.

What is it?

Represent document as word counts. Ignore order completely.

“The cat sat on the mat” “The dog sat on the log”

Vocabulary: [the, cat, sat, on, mat, dog, log]

Document 1: [2, 1, 1, 1, 1, 0, 0] Document 2: [2, 0, 1, 1, 0, 1, 1]

That’s it. Count each word.

Bag of Words Process

See it visualized: Bag of Words Animation

Building it

from collections import Counter

def bag_of_words(documents):
    # build vocabulary
    vocab = set()
    for doc in documents:
        vocab.update(doc.split())
    vocab = sorted(vocab)
    word_to_idx = {w: i for i, w in enumerate(vocab)}
    
    # vectorize
    vectors = []
    for doc in documents:
        counts = Counter(doc.split())
        vec = [counts.get(w, 0) for w in vocab]
        vectors.append(vec)
    
    return vectors, vocab

Or just use sklearn:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

The problems

Ignores word order

“Dog bites man” and “Man bites dog” have identical BoW vectors. Completely different meaning.

Sparse and high dimensional

10,000 word vocabulary = 10,000 dim vectors. Mostly zeros.

No semantic similarity

“Happy” and “joyful” are as distant as “happy” and “angry”. No meaning captured.

Common words dominate

“The”, “is”, “a” appear everywhere. Don’t help distinguish documents.

TF-IDF to the rescue

Term Frequency - Inverse Document Frequency

Weight words by:

  • How often they appear in this document (TF)
  • How rare they are across all documents (IDF)

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

$$\text{IDF}(t) = \log\frac{N}{|{d : t \in d}|}$$

Words appearing in every document get low weight. Rare, distinctive words get high weight.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

N-grams

Capture some word order by including consecutive word pairs (bigrams), triples (trigrams).

“The cat sat” with bigrams:

  • Unigrams: [the, cat, sat]
  • Bigrams: [the_cat, cat_sat]
vectorizer = CountVectorizer(ngram_range=(1, 2))

Vocabulary explodes but captures more structure.

When BoW still works

  • Document classification (news categories, spam)
  • Search and information retrieval (with TF-IDF)
  • Baseline for comparison
  • When you need interpretability
  • Small datasets

When it fails

  • Sentiment analysis (word order matters)
  • Question answering
  • Anything requiring understanding
  • Short texts (not enough words)