Techniques of Feature Selection in Machine Learning

Facing issues with Overfitting and Low accuracy? Feature Selection comes to the rescue

Srivignesh Rajan
Analytics Vidhya

--

Dimensionality Reduction

  • Dimensionality reduction is the process of reducing the set of available features in the dataset.
  • The model could not be applied to the entire set of features directly which may lead to spurious predictions and generalization issues which in turn makes the model unreliable.
  • In order to prevent these issues dimensionality reduction is applied.

Need for dimensionality reduction

Dimensionality reduction prevents overfitting.

  • Overfitting is when the model memorizes the data and fails to generalize. Overfitting can be caused by flexible models (like decision tree) and high dimensional data as well.
  • The Overfitted model could not be applied to real-world problems due to the problem of generalization.

Types of Dimensionality Reduction

  • Feature Selection: Feature selection methods attempt to reduce the features by discarding the least important features.
  • Feature Extraction: Feature extraction methods attempt to reduce the features by combining the features and transforming it to the specified number of features.

Feature Selection

  1. Filter methods
  2. Wrapper methods
  3. Embedded methods
  4. Feature Importance

Import the required libraries

Load the preprocessed data

The training data has been preprocessed already. The preprocessing steps involved are,

  1. MICE Imputation
  2. Log transformation
  3. Square root transformation
  4. Ordinal Encoding
  5. Target Encoding
  6. Z-Score Normalization

For the detailed implementation of the above-mentioned steps refer to my Kaggle notebook on data preprocessing:

Notebook Link

Feature Selection

1. Filter Methods

Filter methods select the features independent of the model used. It can use the following methods to select a useful set of features,

  • Correlation for numeric columns
  • Chi2 association for categoric columns

Select K Best in sci-kit learn

F_Regression:

F_Regression is used for numeric variables. F_Regression consists of 2 steps:

  • Correlation is computed using each feature with the target.
  • The correlation is then converted to an F score and then to a p-value.
Correlation formula

Chi2:

  • Chi2 is used for testing the association between categorical variables.
Chi2 formula

2. Wrapper methods

Wrapper methods make use of an estimator to select a useful set of features. The techniques available are,

  • Recursive Feature Elimination
  • Recursive Feature Elimination Cross-Validation

Recursive Feature Elimination (RFE)

  • The estimator that is provided to RFE assigns weights to features (e.g., the coefficients), RFE recursively eliminates a subset of features which have low weights assigned to it.
  • The estimator is trained with the initial set of features. The estimator might have attributes such as coef_ or feature_importances_. With that attribute of the estimator, we find the weights of each feature.
  • The least weighted features are removed from the current set of features. This procedure is repeated on the removed set until the specified number of features to select is finally reached.

Recursive Feature Elimination Cross-Validation (RFECV)

  • RFECV is very similar to RFE but it uses Cross Validation at each training phase and finally outputs the optimal number of columns to select.

3. Embedded Methods

Embedded methods select features during the training process itself.

  • The coefficients of features become zero when the importance of that feature is low and therefore that feature is not utilized to make predictions.

LASSO regression

LASSO stands for Least Absolute Shrinkage and Selection Operator

LASSO cost function

λ = Penalty (Tuning Parameter)

When λ = 0 no parameters are eliminated and when λ = 1 it is equal to linear regression.

  • The parameter estimates are found by minimizing this cost function.
  • When the coefficient estimates are less than λ/2 the coefficients become zero.

4. Feature Importance

The feature importances is calculated after fitting the model to the entire set of features which assigns weights to each of the features.

  • The model might have attributes such as coef_ or feature_importances_ which help to select the subset of features. Using this the least important features are pruned.

Select From Model in sci-kit learn

Find this post in my Kaggle notebook: https://www.kaggle.com/srivignesh/feature-selection-techniques

References:

[1] M. Ramaswami and R. Bhaskaran, A Study on Feature Selection Techniques in Educational Data Mining (2009)

Connect with me on LinkedIn, Twitter!

Thank you!

--

--

Srivignesh Rajan
Analytics Vidhya

Aspiring Machine Learning Practitioner 👨🏻‍💻 💻