Data cleaning is the process of preparing data for analysis by modifying or removing the data that is inaccurate, incomplete, meaningless, duplicated, or formatted inappropriately. Usually, this data is not essential or helpful when it comes to analyzing data as it may disturb the further processes or may generate inaccurate or irrelevant outputs. There are several methods for data cleaning data depending upon the nature of data.
The purpose of data cleaning is not merely to erase information to create room for fresh data, but to find a manner to improve the precision of a data set without necessarily deleting information.
For one, data cleaning includes more actions than removing data, such as fixing spelling and syntax errors, standardizing data sets, and correcting mistakes such as empty fields, missing codes, and identifying duplicate data points. Data cleaning is considered a foundation element of the data science basics because it plays an important role in the analysis and uncovering of the reliable answers that are hidden inside the data.
What are the techniques of Data Cleaning?
a) Handling(Filling) missing values: The missing values in data can be filled using any of the techniques mentioned below:
Ignoring/Dropping: In some of the cases, it is better to ignore or drop a tuple that contains a missing value rather than filling it. Generally, this is practiced in large datasets, where excluding some tuples does not affect the information conveyed by the data. But it is discouraged for the small datasets as it might lead to loss of important information.
ii. Fill Missing values manually: You can also fill the missing values manually by understanding the nature of the data. Usually, this is performed in a small dataset rather than a large dataset as it is more time-consuming in the case of a large dataset.
iii. Filling Central values (Mean/Median) in missing values: This technique is far better than the above-mentioned ones. In these techniques, we insert the mean or median of the respective attribute to the missing values. For better results first, we group the data on the basis of similarities of attributes and apply this technique.
iv. Interpolation: This is one of the reliable, accurate, and scientific ways of filling missing value. According to the interpolation technique, we first develop relationships among the attributes and then predict the most probable and accurate value for the missing places.
This can be achieved by regression, Bayesian formulation, and Decision tree induction.
b) Removing Noise (smoothing) from Data: What is noise in data? Actually, noise in data is any kind of random error or variance in measured attributes. The outliers present in data can also be regarded as noise. The noise present in data may highly affect our mining result (or we can say prediction). So noisy data is not considered as good data for mining purposes and it should be removed as far as possible. Before we remove noise let’s know how can we detect noise in our data? There are many noise detecting techniques that we can use, but the most scientific and informative technique is the visualization technique. It includes visualization of different attributes of data in the form of graphs or plots. Some of the informative plots include scattering plots, box plots, etc.
One of the most popular methods used for smoothing (Noise removing) our data is Binning method. Binning method is used to smooth the sorted value by looking at its neighborhoods. The sorted values are distributed into the number of bins (groups or buckets). This is also called local smoothing as it consults neighbors for noise removal.
Let’s see what actually binning means from an example.
We have sorted data as – 7, 9,14,15,17,19,22,25
Bin 1 = 7, 9, 14, 15
Bin 2 = 17, 19, 22, 25
Smoothing by Bin means: We replace each member of the bin by the mean of respective bins. It can be shown as:
Bin 1 = 11.25, 11.25, 11.25, 11.25
Bin 2 = 20.75, 20.75, 20.75, 20.75
Smoothing by Bin boundary: We replace the values with the nearest boundary value of the bin. It can be shown as:
Bin 1 = 7, 7, 15, 15
Bin 2 = 17, 17, 25, 25
Smoothing can also be done by removing outliers. When similar values are clustered (grouped) then the values that remain outside the cluster are called outliers.
For more on data preprocessing- visit here