Evaluating classification algorithms

Classifier Accuracy
Measures
Evaluating Classification Algorithms
Utkarsh Sharma
Asst. Prof. CSE Dept.
JUET(M.P.) India
18-04-2020 Utkarsh Sharma 1

Contents
• Why Do we need Evaluation
• Metrics for evaluation
• Confusion Matrix

Why Evaluation..???
• How to evaluate the performance of a model?
• How to obtain reliable estimates?
• How to compare the relative performance among competing models?
• Given two equally performing models,
which one should we prefer?

Need
• Evaluating the quality of our machine learning model is
extremely important for continuing to improve our model until it
performs as best as it can.
• For classification problems, evaluation metrics compare the
expected class label to the predicted class label or interpret the
predicted probabilities for the class labels.
• For example, suppose you used data from previous sales to train a
classifier to predict customer purchasing behavior. You would like an
estimate of how accurately the classifier can predict the purchasing
behavior of future customers, that is, future customer data on which
the classifier has not been trained.

Metrics for Performance Evaluation
1. Confusion Matrix
2. Precision
3. Recall
4. Accuracy
5. Specificity
6. F1-Score

Confusion Matrix
• The confusion matrix is a useful tool for analyzing how well your classifier
can recognize tuples of different classes.
• A confusion matrix is a table with four different combinations of predicted
and actual values.
• Given m classes, a confusion matrix is a table of at least size m by m. An
entry, CMi, j in the first m rows and m columns indicates the number of
tuples of class i that were labeled by the classifier as class j.
• For a classifier to have good accuracy, ideally most of the tuples would be
represented along the diagonal of the confusion matrix, from entry CM1, 1
to entry CMm, m, with the rest of the entries being close to zero.

Confusion Matrix Example(Binary Classification)
True Positives (TP) : The number of times our
model predicted YES and the actual output
was also YES.
True Negatives (TN): The number of times
our model predicted NO and the actual
output was NO.
False Positives (FP): The number of times our
model predicted YES and the actual output
was NO. This is known as a Type 1 Error.
False Negatives (FN): The number of times
our model predicted NO and the actual
output was YES. This is known as a Type 2
Error.

Example to Read confusion matrix:
TP = 100
TN = 50
FP = 10
FN = 5

Accuracy
• Accuracy is determining out of all the classifications, how many did
we classify correctly? This can be represented mathematically as:
• Using our confusion matrix terms, this equation is written as:
• We want the accuracy score to be as high as possible. It is important to note that accuracy may
not always be the best metric to use, especially in cases of a class-imbalanced data set. This is
when the distribution of data is not equal across all classes.

Sometimes Accuracy is not Enough
• Consider a 2-class problem
• Number of Class 0 examples = 9990
• Number of Class 1 examples = 10
• If model predicts everything to be class 0,
• accuracy is 9990/10000 = 99.9 %
• Accuracy is misleading because model does not detect any class 1
example

Precision
• Precision can be described as the fraction of relevant instances
among the retrieved instances. This answers the question “ What
proportion of positive identifications was actually correct?” The
formula is as follows:
• In the terms of our confusion matrix, the equation can be
represented as:
• Precision expresses the proportion of the data points our model says was relevant actually were
relevant.

Recall
• Recall, also known as sensitivity. This answers the question “What
proportion of actual positives was classified correctly?” This can be
represented by the following equation:
• In our confusion matrix, it would be represented by:
• Recall expresses which instances are relevant in a data set. It is important to examine both the
Precision AND Recall when evaluating a model because they often have an inverse relationship.
When precision increases, recall tends to decrease and vice versa.

Specificity
• Specificity (SP) is calculated as the number of correct negative
predictions divided by the total number of negatives. It is also called
true negative rate (TNR). The best specificity is 1.0, whereas the worst
is 0.0.
• Specificity is the exact opposite of Recall.

F1-Score
• The F1 Score is a function of precision and recall. It is used to find the
correct balance between the two metrics. It determines how many
instances your model classifies correctly without missing a significant
number of instances. This score can be represented by the following
equation:
• Having an imbalance between precision and recall, such as a high
precision and low recall, can give you an extremely accurate model,
but classifies difficult data incorrectly. We want the F1 Score to be as
high as possible for the best performance of our model.

References
• https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
• https://medium.com/thalus-ai/performance-metrics-for-classification-problems-in-machine-
learning-part-i-b085d432082b
• https://towardsdatascience.com/evaluation-metrics-for-classification-problems-in-machine-
learning-d9f9c7313190
• Han, Jiawei, and Micheline Kamber. Data Mining: Concepts and Techniques. San Francisco:
Morgan Kaufmann Publishers, 2001.
• https://www.slideshare.net/pierluca.lanzi/dmtm-lecture-06-classification-evaluation

Evaluating classification algorithms

More Related Content

What's hot

Similar to Evaluating classification algorithms

More from Utkarsh Sharma

Recently uploaded

Evaluating classification algorithms