Word2Vec problem: what about misspellings? What about “unhappiness” when you only trained on “happy”? What about German compound words?

FastText fixes this by using subword information. Facebook AI released it in 2016.

The core idea

Instead of one vector per word, break words into character n-grams.

“where” with n=3:

  • “<wh”, “whe”, “her”, “ere”, “re>”
  • Plus the word itself: “

Word vector = sum of all its n-gram vectors.

def get_ngrams(word, min_n=3, max_n=6):
    word = '<' + word + '>'  # boundary markers
    ngrams = []
    for n in range(min_n, max_n + 1):
        for i in range(len(word) - n + 1):
            ngrams.append(word[i:i+n])
    return ngrams

get_ngrams('cat')
# ['<ca', 'cat', 'at>', '<cat', 'cat>', '<cat>']

FastText Subwords

Interactive demo: FastText Animation

Why this works for OOV

Never seen “unhappyness” (misspelled)?

Break it into n-grams. Some of those n-grams appeared in:

  • “happy”
  • “unhappy”
  • “happiness”
  • “sadness”

The vector is constructed from n-grams the model has seen. Not perfect but way better than nothing.

Practical advantages

Morphologically rich languages

Finnish, Turkish, German… words have many forms. FastText handles this naturally because related forms share n-grams.

“playing”, “played”, “plays” all share “play” n-grams.

Typos and variations

“learning”, “leanring”, “lerning” will have similar vectors.

Rare words

Word appearing once? In Word2Vec, vector is garbage. In FastText, n-grams have been seen in other words.

Using FastText

Official library:

import fasttext

# train
model = fasttext.train_unsupervised(
    'data.txt',
    model='skipgram',  # or 'cbow'
    dim=100,
    minn=3,  # min n-gram
    maxn=6,  # max n-gram
)

# get vector (works for any word!)
vec = model.get_word_vector('somemadeupword')

# similar words
model.get_nearest_neighbors('cat')

With Gensim:

from gensim.models import FastText

model = FastText(
    sentences,
    vector_size=100,
    window=5,
    min_count=1,
    min_n=3,
    max_n=6,
)

Pretrained vectors

FastText released vectors for 157 languages. Trained on Wikipedia + Common Crawl.