How to Handle Missing Data | Machine learning | Data science

Mahesh Jadhav
3 min readJun 5, 2023
Handling Missing Data

Handling missing data is one of the important tasks that a data scientist needs to be an expert in. When working on real-world projects, it is common to encounter missing data, and dealing with it requires careful planning to avoid bias and ensure accurate analysis and efficient model training.

Missing data can be categorized into three types -

1. MCAR — Missing completely at Random.
2. MAR — Missing at Random.
3. MNAR — Missing not at Random.

There are mainly two ways to handle the missing data, either remove the missing values or impute the missing values based on some calculations.

Remove missing values:

The missing values are simply removed from the dataset. We use this method if the missing data is either MCAR or MAR and constitutes less than 5% of the total available data.

Advantages and Disadvantages:
1. Easy to implement.
2. Preserves distribution if Data is MCAR.
3. Excluded data may contain important data and result in decreased model performance.
4. Unable to handle missing data in production.

  1. Listwise deletion:
    Also known as CCA (Complete Case Analysis), It discards rows where values in any of the columns are missing.
  2. Pairwise deletion:
    Also known as ACA (Available Case Analysis), It minimizes the data loss compared to Listwise deletion by ignoring the missing values based on the correlation strength of the relationship between two variables.
  3. Dropping Column:
    If any of the columns contain a large proportion of missing values and show no correlations with the target variable, then instead of removing rows, we can drop the entire column to simplify the dataset.

Imputing missing values:

In this technique, we predict the most appropriate value to replace missing values using different statistical methods.

Univariate Imputation:

Missing values are predicted based on the information available within that specific variable. Values are calculated with different methods based on the data type of that specific variable.

Advantages and Disadvantages:
1. Overcomes data loss issue.
2. It may change shape of the distribution of data.
3. Change in covariance and correlation between data.
4. It may identify unnecessary outliers.

Numerical Imputations:

  1. Mean:
    Missing values are replaced with the mean of the column. Works best on normally distributed data
  2. Median:
    Missing values are replaced with the median of the column. Works best on skewed distribution.
  3. End of Distribution:
    missing values are replaced with far-end values or extreme far-end values. Based on distributions we use the below formulas.
    Normal: (mean — 3σ) or (mean + 3σ)
    Skewed: (Q1–1.5IQR) or (Q3 + 1.5IQR)
  4. Random:
    Missing values are replaced with a random value selected from the available unique values present in that column. This technique helps preserve the variance and distribution of the data but can be memory-heavy during deployment.

Categorical Imputations:

  1. Mode:
    Here, missing values are replaced with the most recurring value, which is the mode of that specific column.
  2. Arbitrary:
    If missing values are MNAR or constitute more than 5% of the total data then we can replace it with custom text like “MISSING”.

Multivariate Imputation:

In this technique, missing values are predicted based on the relationship between the data present in another column using different concepts like correlation, covariance, and Euclidean distance between two points.

Advantages and Disadvantages:
1. Gives the most accurate predictions for missing data.
2. More no of calculations are required which may slow down the process.
3. Memory heavy in case of deployment on production.

  1. KNN Imputer:
    Missing values are predicted with the help of the K-nearest neighbor algorithm using Euclidean distance. where k is the no. of nearest neighbors to be taken in the calculation.
    Euclidean distance(x,y) = sqrt(weight * sq. distance from present coordinates)
    where, weight = Total no of coordinates / no of present coordinates
  2. Iterative Imputer:
    Also known as MICE (Multivariate Imputation by Chained Equations), this technique helps to predict missing values using a machine learning (ML) model. The process involves filling in missing values using SimpleImputer with any chosen strategy. The missing feature column is designated as the output variable y, while the other feature columns are treated as input variables X. A regressor is fitted on (X, y) for known y values. Subsequently, the regressor is used to predict the missing values of y. This iterative process is performed for each feature, and repeated for a maximum of max_iter imputation rounds. The results obtained from the final imputation round are returned.

--

--