Data Preprocessing Pipeline in Machine Learning

Data Preprocessing pipeline for Machine Learning using Kaggle House Price Prediction competition’s data: A walkthrough in Python

Srivignesh Rajan
The Startup

--

Photo by Greg Rakozy on Unsplash

Data Preprocessing:

Data preprocessing is a predominant step in machine learning to yield highly accurate and insightful results. Greater the quality of data, the greater is the reliability of the produced results. Incomplete, noisy, and inconsistent data are the inherent nature of real-world datasets. Data preprocessing helps in increasing the quality of data by filling in missing incomplete data, smoothing noise, and resolving inconsistencies.

  • Incomplete data can occur due to many reasons. Appropriate data may not be persisted due to a misunderstanding, or because of instrument defects and misfunctions.
  • Noisy data can occur for a number of reasons (having incorrect feature values). The instruments used for the data collection might be faulty. Data entry may contain human or instrument errors. Data transmission errors might occur as well.

There are many stages involved in data preprocessing,

  1. Data Cleaning
  2. Data Integration
  3. Data Transformation
  4. Data Reduction

Data cleaning attempts to impute missing values, smooth out noise, resolve inconsistencies, removing outliers in the data.

Data integration integrates data from a multitude of sources into a single data warehouse.

Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements.

Data reduction can reduce the data size by dropping out redundant features. Feature selection and feature extraction techniques can be used.

Import the required libraries and load the dataset for training and testing

1. Data Cleaning

1.1 Find the missing percentage of each column in the training set.

1.2 Drop the columns which have more than 70% of missing values

1.3 MICE (Multiple Imputation by Chained Equation)

Imputation of missing values can be done using two techniques,

  • Single Imputation
  • Single Imputation attempts to impute the missing data by a single value as opposed to Multiple Imputation which replaces the missing data with multiple values.
  • Multiple Imputation
  • In multiple imputations, the missing data is imputed with multiple values multiple times culminating in a multitude of imputed datasets.

MICE Algorithm:

  • Step 1: Initially, the dataset is imputed with mean which acts as the “placeholders”.
  • Step 2: A missing variable is chosen at random and entitled as the target and the other features except that will act as features. A regression model is trained on these features and the target.
  • Step 3: The missing values of the target are then replaced with predictions (imputations) from the regression model.
  • Step 4: Steps 2–3 are then repeated for each feature that has unobserved data. At the end of a single cycle, all the missing values would have been imputed with the predictions.
  • Step 5: Steps 2–4 are repeated for the specified number of cycles.

MICE Algorithm for Categorical data:

Before going through steps 1 to 5 in the MICE algorithm the following steps must be done in order to impute categorical data.

  • Step 1: Ordinal Encode the non-null values
  • Step 2: Use MICE imputation with Gradient Boosting Classifier to impute the ordinal encoded data
  • Step 3: Convert back from ordinal values to categorical values.
  • Step 4: Follow steps 1 to 6 in MICE Algorithm. Instead of using Mean imputation for initial strategy use Mode imputation.

2. Data Visualization

Sample Output:

Histogram of numeric columns.

3. Data Transformation

3.1 Skewed data:

Skewness is the distortion of data from normality. It may act as outliers and produce unreliable results. The skewed data must be transformed back to normal distribution.

  • The distribution is highly skewed if the skewness is less than -1 or greater than 1.
  • The distribution is moderately skewed if the skewness is between -1 and -0.5 or between 0.5 and 1
  • The distribution is approximately symmetric if the skewness is between -0.5 and 0.5.
  • The distribution is symmetric if the skewness is 0.

3.1.1 Positively skewed data:

Log transformation (when the data is highly skewed)

  • log(X) — if no zero values are present, log(C + X) — if zero values are present
  • C is a constant added so that the smallest value will be equal to 1.

Square root transformation (when the data is moderately skewed)

  • sqrt(X)

3.1.2 Negatively skewed data:

Reflect and Log transformation

  • log(K — X) — K is a constant from which the values are subtracted so that the smallest value is 1.
  • (K — X) makes the large number small and the small number large so the negatively skewed data becomes positively skewed.

Reflect and Square root transformation

  • sqrt(K — X)

3.2 Categorical Encoding

3.2.1 Ordinal Encoding:

Ordinal columns are the ones that have ordinality or inherent order in themselves. Example ratings and feedback like excellent, good, fair, poor.

  • Various Ordinal Encoding techniques are,
  • Label Encoding
  • Binary Encoding

3.2.2 Nominal Encoding:

Nominal columns are the ones that do not have any ordinality or inherent order. Example country names, gender (male, female).

  • Various Nominal Encoding techniques available are,
  • Frequency Encoding
  • Target Encoding
  • M-Estimate Encoding
  • Leave One Out Encoding
  • One-Hot Encoding

3.2.2.1 Target Encoding:

Target encoding is the process of encoding a qualitative/ categoric value with the mean of the target variable.

3.3 Normalization:

  • Normalization is also called as Feature Scaling. Normalization scales the values of features between a certain interval. Eg: [0,1]

4. Data Modeling

Fit XGBoost Regressor model to the preprocessed data.

Find this post in my Kaggle Notebook: https://www.kaggle.com/srivignesh/data-preprocessing-for-house-price-prediction

Data Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

References:

[1] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Second Edition (2006)

[2] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf, Multiple imputation by chained equations: what is it and how does it work? (2011), National Center for Biotechnology Information.

[3] Suraj Donthi, Dealing with missing data in python.

Connect with me on LinkedIn, Twitter!

Thank you!

--

--

Srivignesh Rajan
The Startup

Aspiring Machine Learning Practitioner 👨🏻‍💻 💻