How to improve a Linear Regression model’s performance using Regularization?
When we talk about supervised machine learning, Linear regression is the most basic algorithm every one learns in data science. Let’s try to understand the term Regression.
Regression is a technique from statistics that is used to predict values of a desired target quantity when the target quantity is continuous. Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent(target) and independent variable(s)(predictor). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables. For instance, if you want to study the relationship between road accidents and casual driving, there is no better technique than regression analysis for this job. It plays a very important role in both analyzing and modelling data. This is done by fitting a line or curve to different data points in a way that we can minimize the difference in data point distances from the line, or the curve.
Following are the benefits of Regression analysis:
- It indicates the significant relationships between dependent variable and independent variable.
- It indicates the strength of impact of multiple independent variables on a dependent variable.
Now, let’s try to understand the term Linear Regression. Linear Regression establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).It is represented by an equation-
Y=a+b*X + e
,where a is intercept, b is slope of the line and e is error term. This equation can be used to predict the value of target variable based on given predictor variable(s).
The loss or error(e) is the error in our predicted value of b and a. Our goal is to minimize this error to obtain the most accurate value of b and a. We use the Mean Squared Error function to calculate the loss.
Gradient descent is an iterative optimization algorithm to find the minimum of a function. Here that function is our Loss/Error Function which will be minimized to get a best fit regression line for the actual and predicted values.
The parameters of the model(beta) must be estimated from the sample of observations drawn from the domain. There are many ways to estimate the parameters given the study of the model such as
- Least Squares Optimization-An approach to estimating the parameters of a model by seeking a set of parameters that results in the smallest squared error between the predictions of the model and the actual outputs, averaged over all examples in the dataset, so-called mean squared error.
- Maximum Likelihood Estimation-A frequentist probabilistic framework that seeks a set of parameters for the model that maximize a likelihood function.
The following are four assumptions that a Linear Regression model makes:
- Linear functional form: The response variable y should be a linearly related to the explanatory variables X.
- Residual errors should be i.i.d.: After fitting the model on the training data set, the residual errors of the model should be independent and identically distributed random variables.
- Residual errors should be normally distributed: The residual errors should be normally distributed.
- Residual errors should be homoscedastic: The residual errors should have constant variance.
Now, let’s discuss about some important concepts such as Bias, Variance, etc. A given dataset is divided into train and test subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values.
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. It always leads to high error on training and test data. This situation is called underfitting.
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data. Such a situation is called overfitting.
In the below diagram, center of the target is a model that perfectly predicts correct values. As we move away from the bulls-eye our predictions become get worse and worse. We can repeat our process of model building to get separate hits on the target.
If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.
To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.
An optimal balance of bias and variance would never overfit or underfit the model.
How are we going to balance the bias and variance to avoid overfitting and underfitting? With the increasing size of datasets, one of the most important factor that prevents us from achieving an optimal balance is the overfitting. One way could be to reduce the number of features or variables in a model. While this would increase the degrees of freedom of the model, there would be a loss of information due to the discarding of features. Thus the model would not have the benefit of all the information that would have been available otherwise. Regularization could be used in such cases that would allow us to keep all the features while working on reducing the magnitude or amplitude of the features available. Regularization also works to reduce the impact of higher-order polynomials in the model. Thus in a way, it provides a trade-off between accuracy and generalizability of a model.
Now, let’s discuss how we can achieve an optimal balance model using Regularization which regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. Intuitively, when a regularization parameter is used, the learning model is constrained to choose from only a limited set of model parameters. Instead of choosing parameters from a discrete grid, regularization chooses values from a continuum, thereby lending a smoothing effect. This smoothing effect provided by regularization is what helps the model to capture the signal well (signal is generally smooth) and filter out the noise (noise is never smooth) thereby doing the magic of fighting to overfitting successfully.
There are three types of Regularization techniques. Two of the commonly used techniques are L1 or Lasso regularization and L2 or Ridge regularization. Both these techniques impose a penalty on the model to achieve dampening of the magnitude as mentioned earlier. In the case of L1, the sum of the absolute values of the weights is imposed as a penalty while in the case of L2, the sum of the squared values of weights is imposed as a penalty. There is a hybrid type of regularization called Elastic Net that is a combination of L1 and L2.
Follwing are the important features of the different types of Regularization.
Lasso Regression:
- Lasso regression is another regularization technique to reduce the complexity of the model. It stands for Least Absolute Shrinkage and Selection Operator.
- The penalty term contains only the absolute weights.
- Since it takes absolute values, hence, it can shrink the slope to 0.
- Some of the features in the dataset are completely neglected for model evaluation.
- It can help us to reduce the overfitting in the model as well as the feature selection.
- It is also called as L1 regularization. Its loss function is
, where λ is the tuning parameter.
Ridge Regression:
- It is used to reduce the complexity of the model by shrinking the coefficients. It is also called as L2 regularization. Ridge regression is one of the types of linear regression in which a small amount of bias is introduced so that we can get better long-term predictions. Loss function is
, where λ is the tuning parameter.
- In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda to the squared weight of each individual feature.
- A general linear or polynomial regression will fail if there is high collinearity between the independent variables, so to solve such problems, Ridge regression can be used.
Elastic Net Regression:
- It uses both Lasso as well as Ridge Regression regularization in order to remove all unnecessary coefficients but not the informative ones.
- In terms of handling bias, Elastic Net is considered better than Ridge and Lasso regression, Small bias leads to the disturbance of prediction as it is dependent on a variable. Therefore Elastic Net is better in handling collinearity than the combined ridge and lasso regression.
- When it comes to complexity, Elastic Net performs better than ridge and lasso regression as both ridge and lasso, the number of variables is not significantly reduced. Here, incapability of reducing variables causes declination in model accuracy.
- Its loss function is
,where α is the mixing parameter between ridge (α = 0) and lasso (α = 1).
In conclusion we understand that to improve the performance of our models, we need to find an optimal balance in the dataset. For this, we can use Regularization which will remove overfitting, which is one of the most important factor hindering our model’s performance.