Cost functions for Regression and its Optimization Techniques in Machine Learning

Probing deep into cost functions of Regression and its Optimization Techniques: A Walkthrough in Python

Srivignesh Rajan
Towards Data Science

--

Cost Function

A Cost function is used to gauge the performance of the Machine Learning model. A Machine Learning model devoid of the Cost function is futile. Cost Function helps to analyze how well a Machine Learning model performs. A Cost function basically compares the predicted values with the actual values. Appropriate choice of the Cost function contributes to the credibility and reliability of the model.

Loss function vs. Cost function

  • A function that is defined on a single data instance is called Loss function.
Absolute loss of Regression
  • A function that is defined on an entire data instance is called the Cost function.
Mean Absolute Error of Regression

Cost functions of Regression

Regression tasks deal with continuous data. Cost functions available for Regression are,

  • Mean Absolute Error
  • Mean Squared Error
  • Root Mean Squared Error
  • Root Mean Squared Logarithmic Error

Mean Absolute Error

Mean Absolute Error(MAE) is the mean absolute difference between the actual values and the predicted values.

  • MAE is more robust to outliers. The insensitivity to outliers is because it does not penalize high errors caused by outliers.
  • The drawback of MAE is that it isn’t differentiable at zero and many Loss function Optimization algorithms involve differentiation to find optimal values for Parameters.

Mean Squared Error

Mean Squared Error(MSE) is the mean squared difference between the actual and predicted values. MSE penalizes high errors caused by outliers by squaring the errors. The optimization algorithms benefit from penalization as it is helpful to find the optimal values for parameters.

  • The drawback of MSE is that it is very sensitive to outliers. When high errors (which are caused by outliers in the target) are squared it becomes, even more, a larger error.
  • MSE can be used in situations where high errors are undesirable.

Root Mean Squared Error

Root Mean Squared Error (RMSE) is the root squared mean of the difference between actual and predicted values. RMSE can be used in situations where we want to penalize high errors but not as much as MSE does.

  • RMSE is highly sensitive to outliers as well. The square root in RMSE makes sure that the error term is penalized but not as much as MSE.

Root Mean Squared Logarithmic Error

Root Mean Squared Logarithmic Error (RMSLE) is very similar to RMSE but the log is applied before calculating the difference between actual and predicted values. The large errors and small errors are treated equally. RMSLE can be used in situations where the target is not normalized or scaled.

  • RMSLE is less sensitive to outliers as compared to RMSE. It relaxes the penalization of high errors due to the presence of the log.

Cost function Optimization Algorithms

Cost function optimization algorithms attempt to find the optimal values for the model parameters by finding the global minima of cost functions. The various algorithms available are,

  • Gradient Descent
  • RMS Prop
  • Adam

Load the preprocessed data

The data you feed to the ANN must be preprocessed thoroughly to yield reliable results. The training data has been preprocessed already. The preprocessing steps involved are,

  • MICE Imputation
  • Log transformation
  • Square root transformation
  • Ordinal Encoding
  • Target Encoding
  • Z-Score Normalization

For the detailed implementation of the above-mentioned steps refer my Kaggle notebook on data preprocessing. Notebook Link

Train the model with ANN

Refer to my Kaggle notebook on Introduction to ANN in Tensorflow for more details.

Gradient Descent

Gradient Descent algorithm makes use of gradients of the cost function to find the optimal value for the parameters. Gradient descent is an iterative algorithm. It attempts to find a global minimum.

On each iteration t,

  • The cost of the data is found.
  • The partial differentiation of cost function with respect to weights and bias is computed.
  • The weights and bias are then updated by making use of gradients of the cost function and learning rate 𝛼. The value of 𝛼 can range from 0.0 to 1.0. Greater the value of 𝛼 greater is the number of steps taken to find the global minimum of the cost function.
  • Continue the above-mentioned steps until a specified number of iterations are completed or when a global minimum is reached.

RMS Prop (Root Mean Squared Prop)

RMS Prop is an optimization algorithm that is very similar to Gradient Descent but the gradients are smoothed and squared and then updated to attain the global minimum of the cost function soon.

On each iteration t,

  • The cost of the data is found.
  • The partial differentiation of cost function with respect to weights and bias is computed.
  • The weights and bias parameters are smoothed and then updated by making use of gradients of cost function and 𝛼 (learning rate).
  • Continue the above-mentioned steps until a specified number of iterations are completed or when a global minimum is reached.

Adam (Adaptive Moment Estimation)

Adam (Adaptive Moment Estimation) is an algorithm that emerged by combining Gradient Descent with momentum and RMS Prop.

On each iteration t,

  • The cost of the data is found.
  • The partial differentiation of cost function with respect to weights and bias is computed.
  • The weights and bias are smoothed with the technique used in RMS Prop and Gradient Descent with momentum and then the weights and bias are updated by making use of gradients of cost function and 𝛼 (learning rate).
  • Continue the above-mentioned steps until a specified number of iterations are completed or when a global minimum is reached.

Summary

  • Mean Absolute Error is robust to outliers whereas Mean Squared Error is sensitive to outliers
  • Gradient descent algorithm attempts to find the optimal values for parameters such that the global minimum of the cost function is found.
  • The algorithms like RMS Prop and Adam can be thought of as variants of Gradient descent algorithm.

Connect with me on LinkedIn, Twitter!

Happy Machine Learning!

Thank you!

--

--