Master Class for Outlier Analysis

Master Class for Outlier Analysis

What is an Outlier ??

We should define what an outlier is before continuing on with the article. Outliers are any observations that stand out from the rest of the data points and are referred to as such. They tend to affect the distribution of the data. They are sometimes deleted, though this isn’t always the best practice, and it’s sometimes worthwhile to look at these points because they can provide meaningful information.

The dataset used for this article is the exam scores of students from two Portuguese schools in the subjects of math and Portuguese. The dataset, along with test scores, contains other demographic information such as the student’s age, size of the family, parents’ education, health, study time, etc.

import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5));
sns.set_theme(style="whitegrid", palette="Set2");df = pd.read_csv("../data/student/student-por.csv",sep=";")

Initially, the dataset was published to check whether the above-mentioned factors had any impact on the final test score of the students, which is the variable named G3.

I want to put a little spin on this and look at any outliers in the student’s performance pertaining to school, home, and lifestyle. For the sake of simplicity, we are only considering the test scores in Portuguese subjects.

For a dataset like this, outliers are expected to be students who perform well without much support or students who perform poorly with all the resources available. We will employ unsupervised learning because we do not have a target variable for our case, which is a variable that indicates whether a student’s performance is anomalous.

Boxplots
Let us analyze the performance of the students who receive support from their families and check for any outliers. For this example, we shall use boxplots.

df['Total_score'] =df['G1'] + df['G2'] + df['G3']

sns.boxplot(x="school",y="Total_score",hue="famsup",data = df,palette=["m", "g"]).set(
xlabel="Total Score", 
ylabel="School"
);
plt.legend(title="Family Support");
plt.title("Overall Test scores of students with respect to schools");

In school GP, the median score is higher for students who receive family support, and it is the opposite for students studying in MS school. We can see that there are 6 points outside of the upper and lower whiskers, of which 5 students score below the lower whisker and one of them above the upper whisker. These points are the outliers.

Z-Scores
Another way to check for outliers is by using z scores. Z-scores generally indicates how many standard deviations a score is away from the mean. The formula for Z-score is

plt.hist(df['G3'],facecolor='blue',bins = 15, alpha=0.5,)
plt.xlabel('Score')
plt.ylabel('Students')
plt.title('Overall distribution of scores in Final Test (G3)');

We can observe from the distribution plot that there is a left tail and that there are some students who performed poorly. However, this is the overall distribution of scores. Let us specifically check the distribution of scores of students who do not receive support from school, family, or do not go for additional paid classes.

df2 = df.loc[(df["schoolsup"]=="no") & (df["famsup"]=="no") & (df["paid"]=="no"),
["G3"]]
plt.hist(df2['G3'],bins = 15,facecolor='blue', alpha=0.5)
plt.xlabel('Score')
plt.ylabel('Students')
plt.title('distribution of scores of students in Final Test (G3) without support');

We can see that the distribution seems slightly tailed where more scores are present near 10 and 15, and we can see that there are a few students who seem to have scored less than 2, thus standing out as an outlier from the rest. For computing the Z-score, we shall be using a built-in method from the scipy library.

import scipy.stats as stats
df2.rename(columns={'G3': 'Final Test Score (G3)'}, inplace=True)df2["z_score"] = df2.apply(stats.zscore)
df2.head()

As we know, the Z score indicates how many standard deviations it is from the mean. The further the point is from the center, the more extreme it becomes. A rule of thumb is to set the threshold at 2.5, 2, 3, 3.5. In our case, the threshold is set at 3. That is, any point that is more than three standard deviations from the mean is considered an outlier.

df2[“outlier”] = (abs(df2[“z_score”])>3).astype(int)
df2[df2[“outlier”] == 1]

Isolation Forests
Now let us discuss another way to check for outliers, and that is the isolation forest. Isolation forests are based on the fact that the anomalous points are different and are in smaller numbers. I have gone ahead and performed the cleaning part where I encoded the variables.

We first construct the isolation forest, where we set the contamination parameter to 9%. Contamination is the percentage of the data we consider to be anomalous.

cols=df.columns[:]
from sklearn.ensemble import IsolationForestclf=IsolationForest(n_estimators=100, max_samples=’auto’, contamination=float(.09),
max_features=1.0, bootstrap=False, n_jobs=-1, random_state=42, verbose=0)
clf.fit(df[cols])pred = clf.predict(df[cols])
df[‘anomaly’]=predoutliers=df.loc[df[‘anomaly’]==-1]
outlier_index=list(outliers.index)print(df[‘anomaly’].value_counts())

We can see that our model has picked that 59 points stand out as anomalies. Let us visualize the data using PCA code to see

To recap we have learned about the different ways to analyze outliers in a given dataset using different methods such as boxplots, z-scores, and then finally isolation forest.

Connect With Us!