How to improve the accuracy of a classification model?
From practitioners to experts in data science are starving for the higher accuracy of their models. There are mainly two kinds of modelling we mainly perform one is regression modelling and another one is classification modelling. Since there is always a requirement for highly accurate models, we need to know about the techniques that can increase the accuracy of the model. In this article, we are going to discuss the ways to increase the accuracy of a classification model. The major points to be discussed in the article are listed below.
Table of contents
- About classification models
- Working on the data side
- Method 1: Acquire more data
- Method 2: missing value treatment
- Method 3: Outlier treatment
- Method 4: Feature engineering
- Working on the model side
- Method 1: Hyperparameter tuning
- Method 2: Applying different models
- Method 3: Ensembling methods
- Method 4: Cross-validation
Let’s start with understanding classification models.
About classification models
This article assumes that the reader should have knowledge about classification models and if some do not have such knowledge in this section we will try to have a general introduction about the classification models.
A classification data, in data science, can consist of information divided into the classes, for example data of people from a city where the gender of a person is included in the information of that person. So in the data gender information can be considered as the data that has the classes.
A classification model can be defined as an algorithm that can tell us about the class of the data points using the other information of the data. For example, in the data of people from a city, we can have information such as name, education, work, and marital status. Based on this information, a trained model can tell us about the gender of the person.
Building such a model that can tell us about the class of any data point using the other information the data point consists of can be called classification modelling. Now if we are practising or we are experts in classification modelling we must have known that we can not build a model with no error. There will always be a possibility of error but as a modeler, it becomes our responsibility to reduce the errors of the model or to increase the accuracy of the model.
In this article, we are going to discuss some of the methods that can increase the accuracy of the classification model. We can also think of this discussion as a method to reduce the error of the classification model. Since in the life of the modelling procedure acquiring data is one of the first steps, in this article, we are going to start our discussion on the methods that can be applied on the data side.
Are you looking for a complete repository of Python libraries used in data science, check out here.
Working on the data side
various methods can be applied to the data to increase the accuracy of the model. Some of the methods that can be applied on the data side are as follows:
Method 1: Acquire more data
One thing that a classification modelling always requires is to let the data tell it about itself as much as possible. We may find problems with classification modelling when the amount of data is very little or less. In such a scenario we are required to have some techniques or sources that can provide us with more data.
However, if we are practicing modelling or participating in competitions getting more data is difficult and we can try to copy-paste data in the training the model to boost the accuracy or if we are working on a company project then we can try to ask for more data from the source if possible. This method is one of the basic methods that can lead us to higher accuracy than before. Let’s move to our second method.
Method 2: Missing value treatment
There are various reasons for the generation of missing values in the data. Here we are not concerned about the generation of missing values but we are concerned about the treatment of missing values. One thing that is very clear here is if there are missing values in the data then it can lead the modelling procedure to disaster. Generation of the biased model, and inaccurate predictions can be the results of modelling with missing values in data. Take a look at the below table
Age | Name | Gender | Working status |
18 | Rahul | M | N |
20 | Archana | F | Y |
29 | Ashok | M | Y |
18 | Anamika | N | |
27 | Ankit | M | Y |
Let’s take a look at the below table which represents the percentage of working people according to the gender
Gender | Percentage |
Female | 100% |
Male | 66.66% |
Here we can see that there is a missing value and the records are showing 100% females are working in the data. Now if we fill the missing value with F then the results will be as follows.
Gender | Percentage |
Female | 50% |
Male | 66.66% |
Here we can see the effect of missing values in the final prediction.
There are various ways to deal with the missing values some of them are as follows:
- Mean, Median, and mode: This method is for dealing with the missing data in the continuous data(age is our case). We can fill missing values using the Mean, median, and mode of the continuous variable.
- Separation of class: This method separates the data points that have missing values and we can use this method for categorical data.
- Missing value prediction: This method uses another model that can make predictions on the missing values. After filling the missing values using a model we can continue our work on the main modelling.
- KNN imputation: This method can be utilized to fill missing values that work on finding the data points that have similar attributes to the class where we have missing values and fills the values same as the information available in the similar data points.
- Filling closest value: This method fills the nearest value in place of the missing value. This method can be worked with continuous data but is best with the time series data.
Method 3: Outlier treatment
Fitting a classification model can also be thought of as fitting a line or area on the data points. So if the data has the data points that are close to each other fitting a model can give us better results because the prediction area is dense. If there are data points that are sparse then the model can become inaccurate and biased toward the sparse data points. To increase the accuracy of the classification model we need ways to outlier treatment. Some of the ways of outlier treatment are as follows:
- Deleting outliers: By drawing the data points in coordinates we can detect the values that are far from the dense area and we can delete sparse data points from the data if we have a very large amount of the data. With a low amount of data, this way of outlier treatment is not a good way.
- Data transformation: Data transformation can help us to get rid of outlier data points in modelling. There are ways to do this such as performing a log of the data points to reduce the sparsity of the data. Binning is also a way to transform data and many algorithms such as decision trees help us in dealing with outlier data points using the binning of data.
- Mean, median, mode: This method is similar to the method we discussed in dealing with missing values. Before using this method we are required to ensure the outlier we have detected is natural or artificial. If the value is artificial we can use the mean, median, or mode of the other data points in the place of an outlier.
- Separation: If the amount of the outlier is higher than the normal then we can separate them from the main data and fit the model on them separately.
Method 4: Feature engineering
Feature engineering is one of the best ways to increase the accuracy of the classification model. Since it lets the model work with only those variables that are highly correlated to the target variable. We can also think of this method as the creation of a hypothesis regarding the accuracy of the model. We can perform feature engineering in three steps:
- Transformation of features: In the core, we can find that transformation of the features includes two main processes that can be applied one by one, or in some cases, we may require to use one of them.
- scaling data: in this process, we normalize the data and make it scales between 0 to 1. For example, if we have three variables in gram, kg, and ton. With such data, it is always required to fit the model after normalizing to improve the accuracy of the model.
- All the models require data that is normally distributed to give higher accuracy. Before fitting a classification model there is always a requirement to remove the skewness of the data as much as possible.
- Creation of features: This process can be considered as the creation of new features using the old features of the data. This process can help us in understanding and generate new insights from the data. Let’s say that the daily traffic on website hours has no relation with the traffic but the minutes are having a relationship. Such information can help improve the accuracy of the model.
- Feature selection: This process let us know about the relation of any feature with the target variables. Using this process we generally reduce the number of features that are going to be modeled. Since the best features are fed into the model it helps in improving the results of the model. Various ways help in selecting the best features:
- Knowledge: Based on the knowledge of the domain we can say which are the variable are most important to be modelled. For example, in the sales on a daily basis from a shop, days and amount of material are important but the name of the customer is not important.
- Parameters: There are some parameters such as P-value that helps in determining the relation between the variables. Using these parameters we can differentiate between important and unimportant features of the data for modelling.
- Dimension reduction: Various dimensional reduction algorithms help us in drawing the data into a lower dimension but also help in understanding inherent relationships in the data.
Working on the model side
In the above section, the methods we have discussed are applying the changes and techniques to the data. After making the data good for modelling we may require to perform some changes in the model side to improve the accuracy of the classification modelling process. Some of the techniques that can be followed to increase the accuracy of classification modelling are as follows:
Method 1: Hyperparameter tuning
In the core of the models, we find some of the units that drive the model to make final results. We call these units as parameters of the model. These parameters take some values to perform their task under the model. For example, the Gini impurity under the decision tree model helps the tree to split the data into branches.
Since we know that the split of the data in the decision tree makes a higher impact on the accuracy of the decision tree model. So to better split, we need to find an optimum value of Gini impurity. Finding an optimal value for parameters of the model is known as hyperparameter tuning.
The impact of the hyperparameter tuning on the performance and accuracy of the model is so high and various packages help in hyperparameter tuning even in an automated nature. Some of them can be found here.
Method 2: Applying different models
There can be several changes that applying a single model after so much hyperparameter tuning will not help us in boosting the accuracy of the procedure. So in such a scenario, applying different classification models to the data can help us in increasing the accuracy of the procedure. However, hyperparameter tuning can be applied after finalizing a model for better accuracy.
Various packages help us in finding an optimal model and hyperparameters of that model. Some of them can be found here.
Method 3: Ensembling methods
In machine learning, ensemble methods are different from general methods of modelling that include various weak models to perform modelling of the data and combine their results. The reason for being more accurate is the results are combined. We can categorize ensembling methods into two categories:
- Averaging methods: These methods work based on combining the results of different models as an average. We can consider these methods better than applying a single model to data. Examples of models of this method are bagging meta estimators and forests of random trees.
- Boosting method: These methods work based on the reduction of the bias of the combined estimator after sequentially applying the base models. These are very powerful in terms of performance and accuracy. Examples of models of these methods are Adaboost and gradient tree boosting.
Method 4: Cross-validation
Although this method is related to both data and model side since we apply models several times in this technique, we can consider cross-validation as a method of working on the modelling side. Also, making a model perform accurately does not mean it is accurate. Cross-validation is a way that verifies the accuracy of the model.
These methods work based on applying the trained model to the data that have classes on which the model is not trained. We can perform this by dividing the whole data into sets of similar data points and changing the group at each training. Then we can make inferences about the data whether it is better for an accurate model or not.
Cross-validation mainly works when the problem of the overfitting of the model is there in the modelling. There are various techniques of cross-validation such as K-fold, leaving one group out, leaving P groups out, etc. Some of the knowledge about cross-validation can be found here.
Final words
In this article, we have discussed the methods that can be applied to increase the accuracy of the classification model. Although some of them are universal and can be used with all types of modelling. Using these techniques, we can improve and verify the performance and accuracy of our classification models.
The post How to improve the accuracy of a classification model? appeared first on Analytics India Magazine.