NLP: TF-IDF and Word2Vec

In this follow-up to our earlier post on N-Grams and Bag of Words, we’ll deconstruct two more fundamental Natural Language Processing (NLP) concepts: TF-IDF and Word2Vec.

What is TF-IDF?

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects how important a word is to a document in a corpus. Unlike the Bag of Words technique, which merely keeps a record of words present in a document, TF-IDF provides a weightage to words. This weightage helps identify important words in a document out of a corpus.

In TF-IDF, TF represents ‘Term Frequency’, i.e., how frequently a term appears in a document, while IDF represents ‘Inverse Document Frequency’, i.e., it diminishes the weight of terms that appear very frequently and increases the weight of terms that appear rarely.

Here is a simple example of how you can calculate TF-IDF from scratch in Python.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

# Suppose we have the following texts
texts = ["I love to play football", "He loves to play basketball", "Basketball is his favorite"]

# This step will convert text into tokens 
vectorizer.fit(texts)

vectorizer.transform(["I love to play basketball"]).toarray()

Understanding the TF-IDF Formula

TF-IDF stands for Term Frequency - Inverse Document Frequency. It consists of two components:

Term Frequency (TF)
Inverse Document Frequency (IDF)

Before we move forward, let’s understand these components in detail:

Term Frequency (TF)

This calculates the number of times a word appears in a document divided by the total number of words in that document. Every document has its own term frequency. The formula is:

$$TF(t) = \frac{{\text{{Number of times term t appears in a document}}}}{{\text{{Total number of terms in the document}}}}$$

Let’s say we have a document containing 100 words where the word ‘cat’ appears 3 times. The term frequency for ‘cat’ is then (3 / 100) = 0.03.

Inverse Document Frequency (IDF)

This measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

$$IDF(t) = \log_{e}\left(\frac{{\text{{Total number of documents}}}}{{\text{{Number of documents with term t in it}}}}\right)$$

For example, let’s assume we have 10,000 documents and the word ‘cat’ appears in 100 of these. Then, the Inverse Document Frequency, i.e., IDF, of ‘cat’ is log(10,000 / 100) = 2.

Combining these together

The TF-IDF weight is the product of these quantities:

$$TF-IDF(t) = TF(t) \times IDF(t)$$

So, for a term ’t’ in a document, the TF-IDF score is the product of its term frequency in the document and its inverse document frequency across the entire document corpus.

This numeric score is used to reflect how important a word is to a document in the corpus. The higher the TF*IDF score, the more important that term is to that document.

What is Word2Vec?

Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW).

Word2Vec represents words in a vector space where similar words are grouped together. This means words sharing similar contexts are located close to each other in the space. The beauty of Word2Vec is that it captures the semantic relationship between words.

Word2Vec models require a lot of text, so we’ll use the pre-trained Word2Vec model which Google trained on Google News dataset.

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

sentences = [['I', 'love', 'to', 'play', 'football'], 
             ['He', 'loves', 'to', 'play', 'basketball'], 
             ['Basketball', 'is', 'his', 'favorite']]

# train model
model = Word2Vec(sentences, min_count=1)

# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

# create a scatter plot of the projection
plt.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()

Conclusion

In summary, both TF-IDF and Word2Vec provide robust and more sophisticated ways to represent human language numerically. By recognizing the significance of words (TF-IDF) or by understanding the context and semantic relationships between words (Word2Vec), these techniques offer more than just a simple bag of words. They form the backbone of various advanced NLP tasks such as semantic search, sentiment analysis, and even dialogue systems for conversational AI.