Data Preparation: Categorical

Introduction Qualitative feature encoding

When dealing with machine learning algorithms, the format and quality of the input data are important to model performance. Categorical data, which is common in datasets, must be translated into a numerical representation for machine learning algorithms to understand and exploit. This is known as encoding.

Dummy encoding with Pandas

Pandas’ ‘get_dummies’ method is an easy and convenient way to convert categorical variable(s) into dummy/indicator variables. It creates a binary column for each category and returns a sparse matrix.

But, it has a drawback - if the categorical variable has many categories, the data space required can increase dramatically.

import pandas as pd

# Sample data
data = {'color': ['blue', 'green', 'green', 'red']}
df = pd.DataFrame(data)

# Using get_dummies
df_dummies = pd.get_dummies(df, columns=['color'], drop_first=True)

print(df_dummies)

DictVectorizer

DictVectorizer is a helper function that converts a dict’s feature array into vectors. It is useful when feature extraction generates a data dictionary and we want to convert it to feature vectors.

DictVectorizer, like get_dummies, may require more memory if the categorical variable contains more categories.

from sklearn.feature_extraction import DictVectorizer

# Sample data
data_dict = [{'Red': 2, 'Blue': 4}, 
             {'Red': 4, 'Blue': 3}, 
             {'Red': 1, 'Yellow': 2}, 
             {'Red': 2, 'Yellow': 2}]

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)

# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(data_dict)

print(df_encoded)

One-hot encoding

One-shot encoding generates new (binary) columns that indicate the presence of each potential value in the original data. While it is a strong tool, if the categorical variable contains several categories, it might result in a large dimensionality.

from sklearn.preprocessing import OneHotEncoder

#### Sample data
data = [['blue'], ['green'], ['green'], ['red']]
onehot_encoder = OneHotEncoder(sparse=False)

onehot_encoded = onehot_encoder.fit_transform(data)
print(onehot_encoded)

Hashing trick

The Hashing method is useful when we have a high number of categories in a categorical feature and want to minimize the dimension. It lets you provide an arbitrary number of characteristics in the output. It may, however, cause conflicts when various strings map to the same feature index.

from sklearn.feature_extraction import FeatureHasher

# Sample data
data_dict = [{'Red': 2, 'Blue': 4}, 
             {'Red': 4, 'Blue': 3}, 
             {'Red': 1, 'Yellow': 2}, 
             {'Red': 2, 'Yellow': 2}]

# Create a FeatureHasher object: hasher
hasher = FeatureHasher(n_features=4)

# Apply hasher to data_dict: hashed_features
hashed_features = hasher.transform(data_dict)

print(hashed_features.toarray())

Conclusion

Choosing the best encoding approach for your categorical data can boost the performance of your machine learning models dramatically. It is determined by several parameters, including the number of categories, the relevance of the categories, and the type of machine learning method utilized.