Methods to Handle Missing Data in Machine Learning

Techniques for handling the Missing data in Machine Learning: A Walkthrough in Python

Srivignesh Rajan
Analytics Vidhya

--

Why handling missing data is important?

The problem of missing data is prevalent in most of the research areas. Missing data produces various problems.

  • First, the missingness of data reduces the power of statistical methods.
  • Second, the missing data can cause bias in the model.
  • Third, many machine learning packages in python does not accept missing data. It needs the missing data to be treated first.

Each of these problems may lead to spurious conclusions that reduce the reliability of the model.

Missing data mechanisms

  • Missing Completely At Random (MCAR): The values are Missing Completely At Random (MCAR) if the missing data is completely not related to both observed and missing instances.
  • Missing At Random (MAR): Missing At Random (MAR) is when the missing data is related to the observed data but not to the missing data.
  • Missing Not At Random (MNAR): Missing Not At Random (MNAR) is data that is neither MAR nor MCAR. This implies that the missing data is related to both observed and missing instances.

Handling Missing data

  1. Dropping Variables
  2. Partial Deletion
  3. Data Imputation

1. Dropping Variables

Delete the column if it consists of more than 70% of missing values otherwise, Data Imputation is the most preferred method than deleting it. Greater the information to the model, the greater is the reliability of the model’s results.

2. Partial Deletion

2.1 Listwise Deletion

Listwise deletion is a technique in which the rows that contain missing values are deleted.

Disadvantages: Listwise deletion diminishes the intensity of the statistical tests performed. More samples of data are the key factor in the analysis. Because listwise deletion drops the data with missing values, it diminishes the data size. Listwise deletion may cause bias in the data.

3. Data Imputation

3.1 Single Imputation

Single Imputation attempts to impute the missing data by a single value as opposed to Multiple Imputation which replaces the missing data with multiple values.

3.1.1 Single Imputation for Numeric columns

3.1.1.1 Mean Imputation

3.1.1.2 Regression Imputation

3.1.2 Single Imputation for Categoric columns

3.1.2.1 Mode Imputation

3.1.1 Single Imputation for Numeric columns

3.1.1.1 Mean Imputation

  • Mean Imputation is the process of imputing the missing data by the mean of the variable and can be done only to numeric columns.

Disadvantage: Mean Imputation is more likely to introduce bias in the model.

3.1.1.2 Regression Imputation

  • A Regression model is fitted where the predictors are the features without missing values and the targets are the features with missing values.
  • The missing values are then replaced with the predictions. Regression imputation is less likely to introduce bias in the model.

3.1.2 Single Imputation for categoric columns

3.1.2.1 Mode Imputation

  • Mode Imputation is the process of imputing the missing data by the mode of the variable and can be done only to categoric columns.

Disadvantage: More likely to introduce bias in the model.

3.2 Multiple Imputation

  • In multiple imputations, the missing data is imputed with multiple values multiple times culminating in a multitude of imputed datasets.

3.2.1 MICE (Multiple Imputation by Chained Equation)

MICE Algorithm:

  • Step 1: Initially, the dataset is imputed with mean which acts as the “placeholders”.
  • Step 2: A missing variable is chosen at random and entitled as the target and the other features except that will act as features. A regression model is trained on these features and the target.
  • Step 3: The missing values of the target are then replaced with predictions (imputations) from the regression model.
  • Step 4: Steps 2–3 are then repeated for each feature that has unobserved data. At the end of a single cycle, all the missing values would have been imputed with the predictions.
  • Step 5: Steps 2–4 are repeated for a specified number of cycles.

MICE Algorithm for Categorical data:

Before going through steps 1 to 6 in the MICE algorithm the following steps must be done in order to impute categorical data.

  • Step 1: Ordinal Encode the non-null values
  • Step 2: Use MICE imputation with Gradient Boosting Classifier to impute the ordinal encoded data
  • Step 3: Convert back from ordinal values to categorical values.
  • Step 4: Follow steps 1 to 5 in MICE Algorithm. Instead of using Mean imputation for initial strategy use Mode imputation.

Find this post in my Kaggle notebook: https://www.kaggle.com/srivignesh/techniques-for-handling-the-missing-data

References:

[1] Hyun Kang, The prevention and handling of the missing data (2013), National Center for Biotechnology Information.

[2] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf, Multiple imputation by chained equations: what is it and how does it work? (2011), National Center for Biotechnology Information.

Connect with me on LinkedIn, Twitter!

Happy Machine Learning!

Thank you!

--

--

Srivignesh Rajan
Analytics Vidhya

Aspiring Machine Learning Practitioner 👨🏻‍💻 💻