Data Pre-processing is one of the prerequisites for real-world Data mining problems. The real-world data are susceptible to high noise, contains missing values and a lot of vague information, and is of large size. These factors cause degradation of the quality of data. And if the data is of low quality, then the result obtained after the mining or modeling of data is also of low quality. So, before mining or modeling the data, it must be passed through a series of quality upgrading techniques called data pre-processing. Thus, data pre-processing can be defined as the process of applying various techniques over the raw data (or low-quality data) in order to make it suitable for processing purposes (i.e. mining or modeling).
Once we know what data pre-processing actually does, the question might arise how is data processing done? Or how it all happens? The answer is obvious; there are series of techniques and algorithms to perform this task. We can choose any of the techniques depending upon our requirements and feasibility. Some of the most common techniques that are almost used in every situation and we will be dealing with in this article are Data Cleaning, Data Merging or Data Integration, Data Reduction, Data Transformation.
Let’s discuss each topic individually:
Let’s first understand what is data cleaning? As the name suggests, data cleaning is the process of cleaning the data. Here cleaning refers to the filling of missing, removing noise and outliers. In most cases, this is the first step of data pre-processing. Generally, data cleaning includes:
iii. Filling Central values (Mean/Median) in missing values: This technique is far better than the above-mentioned ones. In these techniques, we insert the mean or median of the respective attribute to the missing values. For better results first, we group the data on the basis of similarities of attributes and apply this technique.
One of the most popular methods used for smoothing (Noise removing) our data is Binning method. This method is used to smooth the sorted value by looking at its neighborhoods. The sorted values are distributed into a number of bins (groups or buckets). This is also called as local smoothing as it consults neighbors for noise removal.
Let’s see what actually binning means from an example.
We have sorted data as – 7, 9,14,15,17,19,22,25
Bin 1 = 7, 9, 14, 15
Bin 2 = 17, 19, 22, 25
Smoothing by Bin means: We replace each member of the bin by the mean of respective bins. It can be shown as:
Bin 1 = 11.25, 11.25, 11.25, 11.25
Bin 2 = 20.75, 20.75, 20.75, 20.75
Smoothing by Bin boundary: We replace the values with the nearest boundary value of the bin. It can be shown as:
Bin 1 = 7, 7, 15, 15
Bin 2 = 17, 17, 25, 25
Smoothing can also be done by removing outliers. When similar values are clustered (grouped) then the values that remain outside the cluster are called outliers.
When dealing with real-world data, we might not find all the required data in a single dataset. In that case, we need to collect data from different sources and merge them into a single dataset and this process of merging or integrating data collected from different sources is called data merging or data integration.
While doing data merging, we encounter the most common issue called redundancy. Redundancy can be detected by doing correlation analysis. For nominal data (e.g. name of people) we use the chi-squared test. For numerical data, we can use the correlation coefficient and covariance test.
Data reduction is defined as the process of reducing data by adopting some strategies in such a way that the analysis of reduced data produces the same information as produced by the original data.
Some of the data reduction strategies include:
And many others.
In this process, we try to change the nature of data using some strategies, so that we can extract important information from them.
Some of the techniques for data transformation are:
iii. Attribute construction/ Feature engineering: First let’s understand what feature engineering is? Actually feature engineering is the process of constructing/engineering new attributes/features by observing the available features and relation between them. This technique is helpful in generating extra information from vague data. This technique can be helpful if we have fewer features but still, they contain hidden information to extract.
Let’s understand it by an example. Suppose we are making some predictive model using a dataset that contains the net worth of citizens of any country. For this dataset, we find that there is a large variation in data. If we feed this data to train any model, then it may generate some undesirable results. So, to get rid of that we opt for normalization.
Some normalization techniques are:
Where, j is the smallest integer such that max|Ui|<1.
I have written this article taking reference to the book DATA-MINING concepts and techniques by Jiawei Han, Micheline Kamber, and Jian Pei. You can download this book free here