This article was published as a part of the Data Science Blogathon Show
IntroductionIn my previous article, I talk about the theoretical concepts about outliers and trying to find the answer to the question: “When we have to drop outliers and when to keep outliers?”. To gain a better understanding of this article, firstly you have to read that article and then proceed with this so that you have a clear idea about the outlier analysis in Data Science Projects. In this article, we will try to give the answer to the following questions along with the Python implementation, 👉 How to treat outliers? 👉 How to detect outliers? 👉 What are the techniques for outlier detection and removal? Let’s get startedHow to treat outliers?👉 Trimming: It excludes the outlier values from our analysis. By applying this technique our data becomes thin when there are more outliers present in the dataset. Its main advantage is its fastest nature. 👉Capping: In this technique, we cap our outliers data and make the limit i.e, above a particular value or less than that value, all the values will be considered as outliers, and the number of outliers in the dataset gives that capping number. For Example, if you’re working on the income feature, you might find that people above a certain income level behave in the same way as those with a lower income. In this case, you can cap the income value at a level that keeps that intact and accordingly treat the outliers. 👉Treat outliers as a missing value: By assuming outliers as the missing observations, treat them accordingly i.e, same as those of missing values. You can refer to the missing value article here 👉 Discretization: In this technique, by making the groups we include the outliers in a particular group and force them to behave in the same manner as those of other points in that group. This technique is also known as Binning. You can learn more about discretization here. How to detect outliers?👉 For Normal distributions: Use empirical relations of Normal distribution. – The data points which fall below mean-3*(sigma) or above mean+3*(sigma) are outliers. where mean and sigma are the average value and standard deviation of a particular column.
Image Source: link 👉 For Skewed distributions: Use Inter-Quartile Range (IQR) proximity rule. – The data points which fall below Q1 – 1.5 IQRor above Q3 + 1.5 IQR are outliers. where Q1 and Q3 are the 25th and 75th percentile of the dataset respectively, and IQR represents the inter-quartile range and given by Q3 – Q1.
Image Source: link 👉 For Other distributions: Use percentile-based approach. For Example, Data points that are far from 99% percentile and less than 1 percentile are considered an outlier.
Image Source: link Techniques for outlier detection and removal:👉 Z-score treatment : Assumption– The features are normally or approximately normally distributed. Step-1: Importing Necessary Dependencies import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns Step-2: Read and Load the Dataset df = pd.read_csv('placement.csv') df.sample(5) Step-3: Plot the Distribution plots for the features import warnings warnings.filterwarnings('ignore') plt.figure(figsize=(16,5)) plt.subplot(1,2,1) sns.distplot(df['cgpa']) plt.subplot(1,2,2) sns.distplot(df['placement_exam_marks']) plt.show() Step-4: Finding the Boundary Values print("Highest allowed",df['cgpa'].mean() + 3*df['cgpa'].std()) print("Lowest allowed",df['cgpa'].mean() - 3*df['cgpa'].std()) Output: Highest allowed 8.808933625397177 Lowest allowed 5.113546374602842 Step-5: Finding the Outliers df[(df['cgpa'] > 8.80) | (df['cgpa'] < 5.11)] Step-6: Trimming of Outliers new_df = df[(df['cgpa'] < 8.80) & (df['cgpa'] > 5.11)] new_df Step-7: Capping on Outliers upper_limit = df['cgpa'].mean() + 3*df['cgpa'].std() lower_limit = df['cgpa'].mean() - 3*df['cgpa'].std() Step-8: Now, apply the Capping df['cgpa'] = np.where( df['cgpa']>upper_limit, upper_limit, np.where( df['cgpa']<lower_limit, lower_limit, df['cgpa'] ) ) Step-9: Now see the statistics using “Describe” Function df['cgpa'].describe() Output: count 1000.000000 mean 6.961499 std 0.612688 min 5.113546 25% 6.550000 50% 6.960000 75% 7.370000 max 8.808934 Name: cgpa, dtype: float64 This completes our Z-score based technique! 👉 IQR based filtering : Used when our data distribution is skewed. Step-1: Import necessary dependencies import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns Step-2: Read and Load the Dataset df = pd.read_csv('placement.csv') df.head() Step-3: Plot the distribution plot for the features plt.figure(figsize=(16,5)) plt.subplot(1,2,1) sns.distplot(df['cgpa']) plt.subplot(1,2,2) sns.distplot(df['placement_exam_marks']) plt.show() Step-4: Form a Box-plot for the skewed feature sns.boxplot(df['placement_exam_marks']) Step-5: Finding the IQR percentile25 = df['placement_exam_marks'].quantile(0.25) percentile75 = df['placement_exam_marks'].quantile(0.75) Step-6: Finding upper and lower limit upper_limit = percentile75 + 1.5 * iqr lower_limit = percentile25 - 1.5 * iqr Step-7: Finding Outliers df[df['placement_exam_marks'] > upper_limit] df[df['placement_exam_marks'] < lower_limit] Step-8: Trimming new_df = df[df['placement_exam_marks'] < upper_limit] new_df.shape Step-9: Compare the plots after trimming plt.figure(figsize=(16,8)) plt.subplot(2,2,1) sns.distplot(df['placement_exam_marks']) plt.subplot(2,2,2) sns.boxplot(df['placement_exam_marks']) plt.subplot(2,2,3) sns.distplot(new_df['placement_exam_marks']) plt.subplot(2,2,4) sns.boxplot(new_df['placement_exam_marks']) plt.show() Step-10: Capping new_df_cap = df.copy() new_df_cap['placement_exam_marks'] = np.where( new_df_cap['placement_exam_marks'] > upper_limit, upper_limit, np.where( new_df_cap['placement_exam_marks'] < lower_limit, lower_limit, new_df_cap['placement_exam_marks'] ) ) Step-11: Compare the plots after capping plt.figure(figsize=(16,8)) plt.subplot(2,2,1) sns.distplot(df['placement_exam_marks']) plt.subplot(2,2,2) sns.boxplot(df['placement_exam_marks']) plt.subplot(2,2,3) sns.distplot(new_df_cap['placement_exam_marks']) plt.subplot(2,2,4) sns.boxplot(new_df_cap['placement_exam_marks']) plt.show() This completes our IQR based technique! 👉 Percentile : – This technique works by setting a particular threshold value, which decides based on our problem statement. – While we remove the outliers using capping, then that particular method is known as Winsorization. – Here we always maintain symmetry on both sides means if remove 1% from the right then in the left we also drop by 1%. Step-1: Import necessary dependencies import numpy as np import pandas as pd Step-2: Read and Load the dataset df = pd.read_csv('weight-height.csv') df.sample(5) Step-3: Plot the distribution plot of “height” feature sns.distplot(df['Height']) Step-4: Plot the box-plot of “height” feature sns.boxplot(df['Height']) Step-5: Finding upper and lower limit upper_limit = df['Height'].quantile(0.99) lower_limit = df['Height'].quantile(0.01) Step-7: Apply trimming new_df = df[(df['Height'] <= 74.78) & (df['Height'] >= 58.13)] Step-8: Compare the distribution and box-plot after trimming sns.distplot(new_df['Height']) sns.boxplot(new_df['Height']) 👉 Winsorization : Step-9: Apply Capping(Winsorization) df['Height'] = np.where(df['Height'] >= upper_limit, upper_limit, np.where(df['Height'] <= lower_limit, lower_limit, df['Height'])) Step-10: Compare the distribution and box-plot after capping sns.distplot(df['Height']) sns.boxplot(df['Height']) This completes our percentile-based technique! End NotesThanks for reading! If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link Please feel free to contact me on Linkedin, Email. Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you. About the authorChirag GoyalCurrently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence. The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. How does Python detect and remove outliers?Removing the outliers
Inplace =True is used to tell python to make the required change in the original dataset. row_index can be only one value or list of values or NumPy array but it must be one dimensional. Full Code: Detecting the outliers using IQR and removing them.
How do you identify outliers in Python?Using IQR. Arrange the data in increasing order.. Calculate first(q1) and third quartile(q3). Find interquartile range (q3-q1). Find lower bound q1*1.5.. Find upper bound q3*1.5.. Anything that lies outside of lower and upper bound is an outlier.. How do you identify and remove outliers?When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences.
How do you remove outliers from a DataFrame in Python?How to remove outliers from a Pandas DataFrame in Python. print(df). z_scores = stats. zscore(df) calculate z-scores of `df`. abs_z_scores = np. abs(z_scores). filtered_entries = (abs_z_scores < 3). all(axis=1). new_df = df[filtered_entries]. print(new_df). |