Raw data is often messy and missing values are a common issue that data scientists have to deal with. Ignoring these missing values or handling them improperly may lead to biased or incorrect results. Therefore, understanding and correctly handling missing data is a critical step in the data preparation process.
Strategies for Handling Missing Values
There are several strategies to handle missing values. Each strategy has its own advantages and disadvantages and the choice of strategy depends on the problem at hand. Here are some of the most common strategies:
- Ignore the feature: If a feature has too many missing values, it might be best to drop the feature altogether.
- Drop the rows: If only a few rows have missing values, it might be best to drop those rows.
- Apply a calculation: For numerical features, you might replace the missing values with the mean, median or mode. For categorical features, you might replace the missing values with the most common category. If there is a relationship between rows, you might use stratified imputation methods.
- Create a new column: Create a new binary feature that indicates whether or not the value was missing. This might be useful if the fact that the value is missing is informative.
Visualizing Missing Values
It’s often useful to visualize missing data to understand the extent and pattern of missing data. We can use libraries like
missingno to visualize missing data.
import missingno as msno import pandas as pd # Assuming df is your DataFrame msno.matrix(df)
Inputting Numerical Values
For numerical values, you can use the mean, median or mode to replace missing values. You might also consider more sophisticated methods, like regression imputation or multiple imputation.
Here is an example using pandas:
# Imputing with the mean df['column_name'].fillna(df['column_name'].mean(), inplace=True) # Imputing with the median df['column_name'].fillna(df['column_name'].median(), inplace=True)
Inputting Categorical Values
For categorical values, you can replace missing values with the most common category. You can also consider using a placeholder value that indicates a missing value.
# Imputing with the most common category df['column_name'].fillna(df['column_name'].mode(), inplace=True)
Advanced Imputation Methods
More sophisticated imputation methods take into account the relationship between the feature with missing values and other features.
- Regression Imputation: In this method, we use a regression model to predict the missing values based on other data.
from sklearn.linear_model import LinearRegression # Assuming that 'column_with_missing_values' is the column with missing values and 'other_columns' are the other columns lr = LinearRegression() lr.fit(df[df['column_with_missing_values'].notnull()]['other_columns'], df[df['column_with_missing_values'].notnull()]['column_with_missing_values']) missing_values_prediction = lr.predict(df[df['column_with_missing_values'].isnull()]['other_columns']) df.loc[df['column_with_missing_values'].isnull(), 'column_with_missing_values'] = missing_values_prediction
- Multiple Imputation: This method is similar to regression imputation but instead of filling in one value, multiple imputations fill in the missing value multiple times. The analysis is then done on all the different datasets and the results are pooled. This method gives a more realistic estimation of the uncertainty caused by missing values.
from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer imputer = IterativeImputer(max_iter=10, random_state=0) df_imputed = imputer.fit_transform(df)
Handling missing data is an important step in the data preparation process. In this post, we have reviewed various strategies to handle missing data. The choice of strategy depends on the problem at hand, the amount of missing data and the nature of the data.
To further improve your skills in handling missing data, I recommend the following resources: