Hands-on Categorical Feature Encoding in Machine Learning

Categorical Feature Encoding: A Walkthrough in python!

Srivignesh Rajan
The Startup

--

Categorical Feature Encoding

Categorical feature encoding is an important step in data preprocessing. Categorical feature encoding is the process of converting the categorical features into numeric features. Categorical variables are also called as Qualitative variables. The results produced by the model varies when different encoding techniques are used.

Two types of categorical features exist,

  • Ordinal Features
  • Nominal Features

Ordinal Features:

  • Ordinal features are the features that have inherent ordering.
  • Eg: Ratings such as Good, Bad.

Nominal Features:

  • Nominal features are the features that don’t have any inherent ordering as opposed to Ordinal features.
  • Eg: Names of persons, gender, yes, or no.

Need for categorical feature encoding

  • Categorical features must be encoded before feeding it to the model because many Machine Learning algorithms don’t support categorical features as their input.
  • Machine Learning algorithms and Deep Learning algorithms would support only numerical variables/ quantitative variables.

Ordinal Encoding Techniques

  • Label Encoding or Ordinal Encoding

Nominal Encoding Techniques

  • Frequency Encoding
  • Target Encoding
  • One-hot Encoding
  • Leave One Out Encoding
  • M-Estimate Encoding

Load the required libraries

Ordinal Encoding

1. Ordinal Encoding using OrdinalEncoder in sci-kit learn

OrdinalEncoder is used to assign numerical values to the categories in the ordinal features.

2. Ordinal Encoding using LabelEncoder in sci-kit learn

LabelEncoder would also produce the same result produced by the OrdinalEncoder.

Nominal Encoding

1. Frequency Encoding

In frequency encoding, each of the categories in the feature is replaced with the frequencies of categories.

The formula for frequency encoding

Category refers to each of the unique values in a feature.

  • Frequency(category) = Number of values in that category
  • Size(data) = Size of the entire dataset.

Disadvantage: If two categories have the same frequency then it is hard to distinguish between them.

2. Target Encoding

In target encoding, each of the categories is replaced with the mean of the target variable. Target encoding is one of the most used categorical encoding techniques in Kaggle.

  • Target Encoding = mean(target of a category)

Disadvantage: Tends to overfit the data if some of the categories have a low number of occurrences.

3. One-Hot Encoding

One Hot Encoding replaces the categories with binary values. ‘N’ number of features is created if the unique values in a feature are equal to ‘N’.

Disadvantages:

  • Tree algorithms cannot be applied to one-hot encoded data since it creates a sparse matrix.
  • When the feature contains too many unique values, that many features are created which may result in overfitting.

4. Leave One Out Encoding

Leave One Out Encoding(LOOE) is very similar to Target Encoding but the difference is LOOE doesn’t consider the current row while calculating the mean of the target.

Disadvantage: Tends to overfit to the data.

5. M-Estimate Encoding

M-Estimate Encoding is also called as additive smoothing overcomes the disadvantages of the Target Encoding (overfitting) by considering a smoothing factor M to encode.

The formula for M-Estimate encoding

Find this post in my Kaggle notebook: https://www.kaggle.com/srivignesh/categorical-feature-encoding-techniques

References:

[1] Patricio Cerda, Ga¨el Varoquaux, and Bal´azs K´egl, Similarity encoding for learning with dirty categorical variables (2018).

Connect with me on LinkedIn, Twitter!

Happy Machine Learning!!

Thank you!

--

--

Srivignesh Rajan
The Startup

Aspiring Machine Learning Practitioner 👨🏻‍💻 💻