How do you identify and remove outliers in python?

This article was published as a part of the Data Science Blogathon

Nội dung chính Show

Introduction
Let’s get started
How to treat outliers?
How to detect outliers?
Techniques for outlier detection and removal:
About the author
Chirag Goyal
How does Python detect and remove outliers?
How do you identify outliers in Python?
How do you identify and remove outliers?
How do you remove outliers from a DataFrame in Python?

Introduction

In my previous article, I talk about the theoretical concepts about outliers and trying to find the answer to the question: “When we have to drop outliers and when to keep outliers?”.

To gain a better understanding of this article, firstly you have to read that article and then proceed with this so that you have a clear idea about the outlier analysis in Data Science Projects.

In this article, we will try to give the answer to the following questions along with the Python implementation,

👉 How to treat outliers?

👉 How to detect outliers?

👉 What are the techniques for outlier detection and removal?

Let’s get started

How to treat outliers?

👉 Trimming: It excludes the outlier values from our analysis. By applying this technique our data becomes thin when there are more outliers present in the dataset. Its main advantage is its fastest nature.

👉Capping: In this technique, we cap our outliers data and make the limit i.e, above a particular value or less than that value, all the values will be considered as outliers, and the number of outliers in the dataset gives that capping number.

For Example, if you’re working on the income feature, you might find that people above a certain income level behave in the same way as those with a lower income. In this case, you can cap the income value at a level that keeps that intact and accordingly treat the outliers.

👉Treat outliers as a missing value: By assuming outliers as the missing observations, treat them accordingly i.e, same as those of missing values.

You can refer to the missing value article here

👉 Discretization: In this technique, by making the groups we include the outliers in a particular group and force them to behave in the same manner as those of other points in that group. This technique is also known as Binning.

You can learn more about discretization here.

How to detect outliers?

👉 For Normal distributions: Use empirical relations of Normal distribution.

– The data points which fall below mean-3*(sigma) or above mean+3*(sigma) are outliers.

where mean and sigma are the average value and standard deviation of a particular column.

Fig. Characteristics of a Normal Distribution

Image Source: link

👉 For Skewed distributions: Use Inter-Quartile Range (IQR) proximity rule.

– The data points which fall below Q1 – 1.5 IQRor above Q3 + 1.5 IQR are outliers.

where Q1 and Q3 are the 25th and 75th percentile of the dataset respectively, and IQR represents the inter-quartile range and given by Q3 – Q1.

Fig. IQR to detect outliers

Image Source: link

👉 For Other distributions: Use percentile-based approach.

For Example, Data points that are far from 99% percentile and less than 1 percentile are considered an outlier.

Fig. Percentile representation

Image Source: link

Techniques for outlier detection and removal:

👉 Z-score treatment :

Assumption– The features are normally or approximately normally distributed.

Step-1: Importing Necessary Dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step-2: Read and Load the Dataset

df = pd.read_csv('placement.csv')
df.sample(5)

Step-3: Plot the Distribution plots for the features

import warnings
warnings.filterwarnings('ignore')
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])
plt.show()

Step-4: Finding the Boundary Values

print("Highest allowed",df['cgpa'].mean() + 3*df['cgpa'].std())
print("Lowest allowed",df['cgpa'].mean() - 3*df['cgpa'].std())

Output:

Highest allowed 8.808933625397177
Lowest allowed 5.113546374602842

Step-5: Finding the Outliers

df[(df['cgpa'] > 8.80) | (df['cgpa'] < 5.11)]

Step-6: Trimming of Outliers

new_df = df[(df['cgpa'] < 8.80) & (df['cgpa'] > 5.11)]
new_df

Step-7: Capping on Outliers

upper_limit = df['cgpa'].mean() + 3*df['cgpa'].std()
lower_limit = df['cgpa'].mean() - 3*df['cgpa'].std()

Step-8: Now, apply the Capping

df['cgpa'] = np.where(
    df['cgpa']>upper_limit,
    upper_limit,
    np.where(
        df['cgpa']<lower_limit,
        lower_limit,
        df['cgpa']
    )
)

Step-9: Now see the statistics using “Describe” Function

df['cgpa'].describe()

Output:

count    1000.000000
mean        6.961499
std         0.612688
min         5.113546
25%         6.550000
50%         6.960000
75%         7.370000
max         8.808934
Name: cgpa, dtype: float64

This completes our Z-score based technique!

👉 IQR based filtering :

Used when our data distribution is skewed.

Step-1: Import necessary dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step-2: Read and Load the Dataset

df = pd.read_csv('placement.csv')
df.head()

Step-3: Plot the distribution plot for the features

plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])
plt.show()

Step-4: Form a Box-plot for the skewed feature

sns.boxplot(df['placement_exam_marks'])

Step-5: Finding the IQR

percentile25 = df['placement_exam_marks'].quantile(0.25)
percentile75 = df['placement_exam_marks'].quantile(0.75)

Step-6: Finding upper and lower limit

upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

Step-7: Finding Outliers

df[df['placement_exam_marks'] > upper_limit]
df[df['placement_exam_marks'] < lower_limit]

Step-8: Trimming

new_df = df[df['placement_exam_marks'] < upper_limit]
new_df.shape

Step-9: Compare the plots after trimming

plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.subplot(2,2,3)
sns.distplot(new_df['placement_exam_marks'])
plt.subplot(2,2,4)
sns.boxplot(new_df['placement_exam_marks'])
plt.show()

Step-10: Capping

new_df_cap = df.copy()
new_df_cap['placement_exam_marks'] = np.where(
    new_df_cap['placement_exam_marks'] > upper_limit,
    upper_limit,
    np.where(
        new_df_cap['placement_exam_marks'] < lower_limit,
        lower_limit,
        new_df_cap['placement_exam_marks']
    )
)

Step-11: Compare the plots after capping

plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.distplot(df['placement_exam_marks'])
plt.subplot(2,2,2)
sns.boxplot(df['placement_exam_marks'])
plt.subplot(2,2,3)
sns.distplot(new_df_cap['placement_exam_marks'])
plt.subplot(2,2,4)
sns.boxplot(new_df_cap['placement_exam_marks'])
plt.show()

This completes our IQR based technique!

👉 Percentile :

– This technique works by setting a particular threshold value, which decides based on our problem statement.

– While we remove the outliers using capping, then that particular method is known as Winsorization.

– Here we always maintain symmetry on both sides means if remove 1% from the right then in the left we also drop by 1%.

Step-1: Import necessary dependencies

import numpy as np
import pandas as pd

Step-2: Read and Load the dataset

df = pd.read_csv('weight-height.csv')
df.sample(5)

Step-3: Plot the distribution plot of “height” feature

sns.distplot(df['Height'])

Step-4: Plot the box-plot of “height” feature

sns.boxplot(df['Height'])

Step-5: Finding upper and lower limit

upper_limit = df['Height'].quantile(0.99)
lower_limit = df['Height'].quantile(0.01)

Step-7: Apply trimming

new_df = df[(df['Height'] <= 74.78) & (df['Height'] >= 58.13)]

Step-8: Compare the distribution and box-plot after trimming

sns.distplot(new_df['Height'])
sns.boxplot(new_df['Height'])

👉 Winsorization :

Step-9: Apply Capping(Winsorization)

df['Height'] = np.where(df['Height'] >= upper_limit,
        upper_limit,
        np.where(df['Height'] <= lower_limit,
        lower_limit,
        df['Height']))

Step-10: Compare the distribution and box-plot after capping

sns.distplot(df['Height'])
sns.boxplot(df['Height'])

This completes our percentile-based technique!

End Notes

Thanks for reading!

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

About the author

Chirag Goyal

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

How does Python detect and remove outliers?

Removing the outliers Inplace =True is used to tell python to make the required change in the original dataset. row_index can be only one value or list of values or NumPy array but it must be one dimensional. Full Code: Detecting the outliers using IQR and removing them.

How do you identify outliers in Python?

Using IQR.

Arrange the data in increasing order..

Calculate first(q1) and third quartile(q3).

Find interquartile range (q3-q1).

Find lower bound q1*1.5..

Find upper bound q3*1.5..

Anything that lies outside of lower and upper bound is an outlier..

How do you identify and remove outliers?

When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences.