ML models don’t read text. They need numbers. Tokenization bridges that gap. Seems simple but lots of nuance.

The basic question

How do you split “Don’t tokenize carelessly!”?

Options:

  • Character level: [“D”, “o”, “n”, “’”, “t”, …]
  • Word level: [“Don’t”, “tokenize”, “carelessly”, “!”]
  • Subword: [“Don”, “’t”, “token”, “ize”, “care”, “less”, “ly”, “!”]

Each has tradeoffs.

Tokenization Methods

See different methods: Tokenization Animation

Character-level

Every character is a token.

Pros:

  • Small vocabulary (just alphabet + symbols)
  • No OOV (out of vocabulary) words
  • Works for any language

Cons:

  • Very long sequences
  • Harder to learn word-level meaning
  • More compute needed

Rarely used alone now.

Word-level

Split on whitespace and punctuation.

text = "Hello, world!"
tokens = text.split()  # ["Hello,", "world!"]
# or with regex
import re
tokens = re.findall(r'\w+|[^\w\s]', text)  # ["Hello", ",", "world", "!"]

Pros:

  • Intuitive
  • Short sequences
  • Each token meaningful

Cons:

  • Huge vocabulary (every word form)
  • OOV problems (“unbelievable” not in vocab?)
  • Morphology issues (run, runs, running = 3 tokens)

Subword tokenization

The sweet spot. Break unknown words into known pieces.

“unhappiness” → [“un”, “happi”, “ness”]

Model can understand new words from known components.

BPE - Byte Pair Encoding

Start with characters. Repeatedly merge most frequent pairs.

Corpus: "low low low lower lowest"

Initial: l o w </w> l o w </w> l o w </w> l o w e r </w> l o w e s t </w>

Iteration 1: merge "l o" → "lo"
Iteration 2: merge "lo w" → "low"
... continue until vocabulary size reached

GPT-2/3/4 use BPE.

WordPiece

Similar to BPE but uses likelihood instead of frequency.

BERT uses WordPiece. Tokens start with ## if not word start.

“unhappiness” → [“un”, “##happi”, “##ness”]

SentencePiece

Treats spaces as characters (▁). Language agnostic.

“hello world” → [“▁hello”, “▁world”]

Used by T5, LLaMA.

In practice

Using transformers library:

from transformers import AutoTokenizer

# BERT (WordPiece)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("unhappiness")
# ['un', '##happiness']

# GPT-2 (BPE)
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokens = tokenizer.tokenize("unhappiness")
# ['un', 'happiness']

# T5 (SentencePiece)
tokenizer = AutoTokenizer.from_pretrained('t5-base')
tokens = tokenizer.tokenize("unhappiness")
# ['▁un', 'happiness']

Vocabulary size

Bigger vocab:

  • Shorter sequences
  • Each token more meaningful
  • More parameters in embedding layer

Smaller vocab:

  • Longer sequences
  • Better generalization
  • Less memory

Common sizes:

  • BERT: 30,000
  • GPT-2: 50,257
  • LLaMA: 32,000