Naïve Bayes Classifier
QUIZZ:
QUIZZ: Probability Basics
• Quiz: We have two six-sided dice. When they are rolled, it could end up
with the following occurrence: (A) dice 1 lands on side “3”, (B) dice 2 lands
on side “1”, and (C) Two dice sum to eight. Answer the following questions:
?
equals
)
,
(
Is
8)
?
)
,
(
7)
?
)
,
(
6)
?
)
|
(
5)
?
)
|
(
4)
?
3)
?
2)
?
)
(
)
1
P(C)
P(A)
C
A
P
C
A
P
B
A
P
A
C
P
B
A
P
P(C)
P(B)
A
P








Probabilistic
Probabilistic
Classification
Classification
Probabilistic Classification
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
)
,
,
,
)
( 1 n
1
L X
(X
c
,
,
c
C
|
C
P 






 X
X
)
,
,
,
( 2
1 n
x
x
x 



x
Discriminative
Probabilistic Classifier
1
x 2
x n
x
)
|
( 1 x
c
P )
|
( 2 x
c
P )
|
( x
L
c
P






What is a
discriminative
Probabilistic
Classifier?
Example
•C1 – benign mole
•C2 - cancer
Probabilistic Classification
Vectors of random variables
• Establishing a probabilistic model for classification
(cont.)
– Generative model
)
,
,
,
)
( 1 n
1
L X
(X
c
,
,
c
C
C
|
P 






 X
X
Generative
Probabilistic Model
for Class 1
)
|
( 1
c
P x
1
x 2
x n
x



Generative
Probabilistic Model
for Class 2
)
|
( 2
c
P x
1
x 2
x n
x



Generative
Probabilistic Model
for Class L
)
|
( L
c
P x
1
x 2
x n
x






)
,
,
,
( 2
1 n
x
x
x 



x
Probability that this
fruit is an apple
Probability that this
fruit is an orange
Background: Methods to create classifiers
• There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: Perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: Naive Bayes
Naive Bayes, model based classifiers
• a) and b) are examples of discriminative classification
• c) is an example of generative classification
• b) and c) are both examples of probabilistic classification
GOOD NEWS: You can create your own hardware/software classifiers!
Probability Basics
• We defined prior, conditional and joint probability for
random variables
– Prior probability:
– Conditional probability:
– Joint probability:
– Relationship:
– Independence:
• Bayesian Rule
)
|
,
)
( 1
2
1 X
P(X
X
|
X
P 2
)
(
)
(
)
(
)
(
X
X
X
P
C
P
C
|
P
|
C
P 
)
(X
P
)
)
(
),
,
( 2
2 ,X
P(X
P
X
X 1
1 
 X
X
)
(
)
|
(
)
(
)
|
(
) 2
2
1
1
1
2
2 X
P
X
X
P
X
P
X
X
P
,X
P(X1 

)
(
)
(
)
),
(
)
|
(
),
(
)
|
( 2
1
2
1
2
1
2
1
2 X
P
X
P
,X
P(X
X
P
X
X
P
X
P
X
X
P 1 


Evidence
Prior
Likelihood
Posterior


Method: Probabilistic Classification with MAP
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* if
• Method of
Method of Generative classification with the MAP rule
1. Apply Bayesian rule to convert them into posterior probabilities
2. Then apply the MAP rule
L
c
,
,
c
c
c
c
|
c
C
P
|
c
C
P 








 1
*
*
,
)
(
)
( x
X
x
X
L
i
c
C
P
c
C
|
P
P
c
C
P
c
C
|
P
|
c
C
P
i
i
i
i
i
,
,
2
,
1
for
)
(
)
(
)
(
)
(
)
(
)
(















x
X
x
X
x
X
x
X
We use this rule in
many applications
Naïve Bayes
Naïve Bayes
Naïve Bayes
• Bayes classification
Bayes classification
Difficulty: learning the joint probability
• Naïve Bayes classification
Naïve Bayes classification
– Assumption that all input attributes are conditionally
independent!
– MAP classification rule: for
)
(
)
|
,
,
(
)
(
)
(
)
( 1 C
P
C
X
X
P
C
P
C
|
P
|
C
P n




 X
X
)
|
,
,
( 1 C
X
X
P n



)
|
(
)
|
(
)
|
(
)
|
,
,
(
)
|
(
)
|
,
,
(
)
,
,
,
|
(
)
|
,
,
,
(
2
1
2
1
2
2
1
2
1
C
X
P
C
X
P
C
X
P
C
X
X
P
C
X
P
C
X
X
P
C
X
X
X
P
C
X
X
X
P
n
n
n
n
n


















L
n
n c
c
c
c
c
c
P
c
x
P
c
x
P
c
P
c
x
P
c
x
P ,
,
,
),
(
)]
|
(
)
|
(
[
)
(
)]
|
(
)
|
(
[ 1
*
1
*
*
*
1 











)
,
,
,
( 2
1 n
x
x
x 



x
For a class, the previous generative model can be
decomposed by n
n generative models of a single
input.
Product of
individual
probabilities
Naïve Bayes Algorithm
Naïve Bayes Algorithm
• Naïve Bayes Algorithm (for discrete input attributes) has two
phases
– 1. Learning Phase
1. Learning Phase: Given a training set S,
Output: conditional probability tables; for elements
– 2. Test Phase
2. Test Phase: Given an unknown instance ,
Look up tables to assign the label c* to X’ if
;
in
examples
with
)
|
(
estimate
)
|
(
ˆ
)
,
1
;
,
,
1
(
attribute
each
of
value
attribute
every
For
;
in
examples
with
)
(
estimate
)
(
ˆ
of
value
target
each
For 1
S
S
i
jk
j
i
jk
j
j
j
jk
i
i
L
i
i
c
C
x
X
P
c
C
x
X
P
N
,
k
n
j
X
x
c
C
P
c
C
P
)
c
,
,
c
(c
c




















L
n
n c
c
c
c
c
c
P
c
a
P
c
a
P
c
P
c
a
P
c
a
P ,
,
,
),
(
ˆ
)]
|
(
ˆ
)
|
(
ˆ
[
)
(
ˆ
)]
|
(
ˆ
)
|
(
ˆ
[ 1
*
1
*
*
*
1 















)
,
,
( 1 n
a
a 






X
L
N
X j
j 
,
Classification is easy, just multiply probabilities
Learning is easy, just
create probability tables.
Tennis Example
Tennis Example
• Example: Play Tennis
The
The learning phase for tennis
for tennis
example
example
Outlook Play=Yes Play=No
Sunny 2/9 3/5
Overcast 4/9 0/5
Rain 3/9 2/5
Temperature Play=Yes Play=No
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5
Humidity Play=Ye
s
Play=N
o
High 3/9 4/5
Normal 6/9 1/5
Wind Play=Yes Play=No
Strong 3/9 3/5
Weak 6/9 2/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
We have four variables, we calculate for each
we calculate the conditional probability table
Formulation of a Classification Problem
•Given the data as found in last slide:
•Find for a new point in space (vector of
values) to which group it belongs (classify)
The test phase
test phase for the tennis example
• Test Phase
– Given a new instance of variable values,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Given calculated Look up tables
– Use the
Use the MAP rule
MAP rule to calculate Yes or No
to calculate Yes or No
P(Outlook=Sunny|Play=No
No) = 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=No) = 3/5
P(Play=No) = 5/14
P(Outlook=Sunny|Play=Yes
Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Play=Yes) = 9/14
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(
Given the fact P(Yes
Yes|
|x
x’) < P(
’) < P(No
No|
|x
x’), we label
’), we label x
x’ to be “
’ to be “No
No”.
”.
Issues Relevant
Issues Relevant
to Naïve Bayes
to Naïve Bayes
Issues Relevant to Naïve Bayes
Issues Relevant to Naïve Bayes
1. Violation of Independence
Assumption
2. Zero conditional probability
Problem
Issues Relevant to Naïve Bayes
Issues Relevant to Naïve Bayes
1. Violation of Independence Assumption
• For many real world tasks,
• Nevertheless, naïve Bayes works surprisingly well anyway!
)
|
(
)
|
(
)
|
,
,
( 1
1 C
X
P
C
X
P
C
X
X
P n
n 






Events are correlated
First Issue
Issues Relevant to Naïve Bayes
Issues Relevant to Naïve Bayes
1. Zero conditional probability Problem
– Such problem exists when no example contains the attribute
value
– In this circumstance, during test
– For a remedy, conditional probabilities are estimated with
)
|
(
)
|
(
)
|
,
,
( 1
1 C
X
P
C
X
P
C
X
X
P n
n 






0
)
|
(
ˆ
, 


 i
jk
j
jk
j c
C
a
X
P
a
X
0
)
|
(
ˆ
)
|
(
ˆ
)
|
(
ˆ
1 





 i
n
i
jk
i c
x
P
c
a
P
c
x
P
)
1
examples,
virtual"
"
of
(number
prior
to
weight
:
)
of
values
possible
for
/
1
(usually,
estimate
prior
:
which
for
examples
training
of
number
:
C
and
which
for
examples
training
of
number
:
)
|
(
ˆ










m
m
X
t
t
p
p
c
C
n
c
a
X
n
m
n
mp
n
c
C
a
X
P
j
i
i
jk
j
c
c
i
jk
j
Second Issue
Another Problem:
Another Problem: Continuous-valued
Continuous-valued
Input Attributes
Input Attributes
• What to do in such a case?
– Numberless values for an attribute
– Conditional probability is then modeled with the normal
distribution
– Learning Phase:
Output: normal distributions and
– Test Phase:
1. Calculate conditional probabilities with all the normal distributions
2. Apply the MAP rule to make a decision
i
j
ji
i
j
ji
ji
ji
j
ji
i
j
c
C
c
X
X
c
C
X
P









 



which
for
examples
of
X
values
attribute
of
deviation
standard
:
C
which
for
examples
of
values
attribute
of
(avearage)
mean
:
2
)
(
exp
2
1
)
|
(
ˆ
2
2






L
n c
c
C
X
X ,
,
),
,
,
(
for 1
1 







X
L
n
)
,
,
(
for 1 n
X
X 






X
L
i
c
C
P i ,
,
1
)
( 




Naïve Bayes Training
• Now that we’ve decided to use a Naïve Bayes classifier, we need to
train it with some data:
21
Naïve Bayes Training
• Training in Naïve Bayes is easy:
• –Estimate P(Y=v) as the fraction of records with Y=v
• Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for which
Xi=u
Naïve Bayes Training
• In practice, some of these counts can be zero
• Fix this by adding “virtual” counts:
• This is called Smoothing
Naïve Bayes Training
• For binary digits, training amounts to averaging all of the training
fives together and all of the training sixes together.
Naïve Bayes Training
Conclusion on classifiers
Conclusion on classifiers
• Naïve Bayes is based on the independence assumption
• Training is very easy and fast; just requiring considering each attribute in each
class separately
• Test is straightforward; just looking up tables or calculating conditional
probabilities with normal distributions
• Naïve Bayes is a popular generative classifier model
1.
1. Performance of naïve Bayes
Performance of naïve Bayes is competitive to most of state-of-the-art classifiers
even if in presence of violating the independence assumption
2. It has many successful applications, e.g., spam mail filtering
3. A good candidate of a base learner in ensemble learning
4. Apart from classification, naïve Bayes can do more…
https://www.datasciencecentral.com/profiles/blogs/naive-bayes-in-one-picture-1
R Example (http://uc-r.github.io/naive_bayes)
• To illustrate the naïve Bayes classifier we will use the attrition data
that has been included in the rsample package. The goal is to predict
employee attrition.
library(rsample) # data splitting
library(dplyr) # data transformation
library(ggplot2) # data visualization
library(caret) # implementing with caret
library(h2o) # implementing with h2o
# convert some numeric variables to factors
attrition <- attrition %>%
mutate(
JobLevel = factor(JobLevel),
StockOptionLevel = factor(StockOptionLevel),
TrainingTimesLastYear = factor(TrainingTimesLastYear)
)
# Create training (70%) and test (30%) sets for the attrition data.
# Use set.seed for reproducibility
set.seed(123)
split <- initial_split(attrition, prop = .7, strata = "Attrition")
train <- training(split)
test <- testing(split)
# distribution of Attrition rates across train & test set
table(train$Attrition) %>% prop.table()
No Yes
0.838835 0.161165
table(test$Attrition) %>% prop.table()
No Yes
0.8386364 0.1613636
• With naïve Bayes, we assume that the predictor variables are
conditionally independent of one another given the response value.
This is an extremely strong assumption. We can see quickly that our
attrition data violates this as we have several moderately to strongly
correlated variables.
library(corrplot)
train %>%
filter(Attrition == "Yes") %>%
select_if(is.numeric) %>%
cor() %>%
corrplot::corrplot()
• However, by making this assumption we can simplify our calculation such that
the posterior probability is simply the product of the probability distribution for
each individual variable conditioned on the response category.
• For categorical variables, this computation is quite simple as you just use the
frequencies from the data. However, when including continuous predictor
variables often an assumption of normality is made so that we can use the
probability from the variable’s probability density function. If we pick a handful
of our numeric features we quickly see assumption of normality is not always
fair.
train %>%
select(Age, DailyRate, DistanceFromHome, HourlyRate, MonthlyIncome,
MonthlyRate) %>%
gather(metric, value) %>%
ggplot(aes(value, fill = metric)) +
geom_density(show.legend = FALSE) +
facet_wrap(~ metric, scales = "free")
• Some numeric features may be normalized with a Box-Cox transformation;
however, we can also use non-parametric kernel density estimators to try
get a more accurate representation of continuous variable probabilities.
Ultimately, transforming the distributions and selecting an estimator is
part of the modeling development and tuning process.
• Laplace Smoother
• We run into a serious problem when new data includes a feature value
that never occurs for one or more levels of a response class. What results
is P(xi|Ck)=0 for this individual feature and this zero will ripple through the
entire multiplication of all features and will always force the posterior
probability to be zero for that class.
• A solution to this problem involves using the Laplace smoother. The
Laplace smoother adds a small number to each of the counts in the
frequencies for each feature, which ensures that each feature has a
nonzero probability of occurring for each class. Typically, a value of one to
two for the Laplace smoother is sufficient, but this is a tuning parameter to
incorporate and optimize with cross validation.
• First, we apply a naïve Bayes model with 10-fold cross validation, which gets 83%
accuracy. Considering about 83% of our observations in our training set do not attrit, our
overall accuracy is no better than if we just predicted “No” attrition for every observation.
• There are several packages to apply naïve Bayes (i.e. e1071, klaR, naivebayes, bnclassify).
library(klaR)
# create response and feature data
features <- setdiff(names(train), "Attrition")
x <- train[, features]
y <- train$Attrition
# set up 10-fold cross validation procedure
train_control <- trainControl(
method = "cv",
number = 10
)
# train model
nb.m1 <- train(
x = x,
y = y,
method = "nb",
trControl = train_control
)
# results
confusionMatrix(nb.m1)
Cross-Validated (10 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction No Yes
No 75.2 8.1
Yes 8.6 8.1
Accuracy (average) : 0.833
• We can tune the few hyperparameters that a naïve Bayes model has.
• Usekernel parameter allows us to use a kernel density estimate for
continuous variables versus a guassian density estimate,
• adjust allows us to adjust the bandwidth of the kernel density (larger
numbers mean more flexible density estimate),
• fL allows us to incorporate the Laplace smoother.
• If we just tuned our model with the above parameters we are able to
lift our accuracy to 85%; however, by incorporating some
preprocessing of our features (normalize with Box Cox, standardize
with center-scaling, and reducing with PCA) we actually get about
another 2% lift in our accuracy.
# set up tuning grid
search_grid <- expand.grid(
usekernel = c(TRUE, FALSE),
fL = 0:5,
adjust = seq(0, 5, by = 1)
)
# train model
nb.m2 <- train(
x = x,
y = y,
method = "nb",
trControl = train_control,
tuneGrid = search_grid,
preProc = c("BoxCox", "center", "scale", "pca")
)
# top 5 modesl
nb.m2$results %>%
top_n(5, wt = Accuracy) %>%
arrange(desc(Accuracy))
usekernel fL adjust Accuracy Kappa AccuracySD KappaSD
1 TRUE 1 3 0.8738671 0.4249084 0.02902693 0.1516638
2 TRUE 2 3 0.8690688 0.4555071 0.03590654 0.1593024
3 TRUE 3 4 0.8690408 0.4404052 0.03069306 0.1545643
4 TRUE 0 2 0.8680792 0.4177358 0.03513233 0.1592905
5 TRUE 3 5 0.8680133 0.3697180 0.02361187 0.1284321
# plot search grid results
plot(nb.m2)
# results for best model
confusionMatrix(nb.m2)
Cross-Validated (10 fold) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction No Yes
No 81.2 9.9
Yes 2.7 6.2
Accuracy (average) : 0.8738
We can assess the accuracy on our final holdout test set.
pred <- predict(nb.m2, newdata = test)
confusionMatrix(pred, test$Attrition)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 349 41
Yes 20 30
Accuracy : 0.8614
95% CI : (0.8255, 0.8923)
No Information Rate : 0.8386
P-Value [Acc > NIR] : 0.10756
Kappa : 0.4183
Mcnemar's Test P-Value : 0.01045
Sensitivity : 0.9458
Specificity : 0.4225
Pos Pred Value : 0.8949
Neg Pred Value : 0.6000
Prevalence : 0.8386
Detection Rate : 0.7932
Detection Prevalence : 0.8864
Balanced Accuracy : 0.6842
'Positive' Class : No
Some Other Sources
• https://web.stanford.edu/class/cs124/lec/naivebayes.pptx
• https://cse.sc.edu/~rose/587/PPT/NaiveBayes.ppt
• http://studentnet.cs.manchester.ac.uk/ugt/COMP24111/materials/
slides/Naive-Bayes.ppt
• http://www.cs.kent.edu/~jin/DM11/Naive-Bayes.ppt

Naive Bayes Classifier.ppt helping others by sharing the ppt

  • 1.
  • 2.
    QUIZZ: QUIZZ: Probability Basics •Quiz: We have two six-sided dice. When they are rolled, it could end up with the following occurrence: (A) dice 1 lands on side “3”, (B) dice 2 lands on side “1”, and (C) Two dice sum to eight. Answer the following questions: ? equals ) , ( Is 8) ? ) , ( 7) ? ) , ( 6) ? ) | ( 5) ? ) | ( 4) ? 3) ? 2) ? ) ( ) 1 P(C) P(A) C A P C A P B A P A C P B A P P(C) P(B) A P        
  • 3.
  • 4.
    Probabilistic Classification Probabilistic Classification •Establishing a probabilistic model for classification – Discriminative model ) , , , ) ( 1 n 1 L X (X c , , c C | C P         X X ) , , , ( 2 1 n x x x     x Discriminative Probabilistic Classifier 1 x 2 x n x ) | ( 1 x c P ) | ( 2 x c P ) | ( x L c P       What is a discriminative Probabilistic Classifier? Example •C1 – benign mole •C2 - cancer
  • 5.
    Probabilistic Classification Vectors ofrandom variables • Establishing a probabilistic model for classification (cont.) – Generative model ) , , , ) ( 1 n 1 L X (X c , , c C C | P         X X Generative Probabilistic Model for Class 1 ) | ( 1 c P x 1 x 2 x n x    Generative Probabilistic Model for Class 2 ) | ( 2 c P x 1 x 2 x n x    Generative Probabilistic Model for Class L ) | ( L c P x 1 x 2 x n x       ) , , , ( 2 1 n x x x     x Probability that this fruit is an apple Probability that this fruit is an orange
  • 6.
    Background: Methods tocreate classifiers • There are three methods to establish a classifier a) Model a classification rule directly Examples: k-NN, decision trees, perceptron, SVM b) Model the probability of class memberships given input data Example: Perceptron with the cross-entropy cost c) Make a probabilistic model of data within each class Examples: Naive Bayes Naive Bayes, model based classifiers • a) and b) are examples of discriminative classification • c) is an example of generative classification • b) and c) are both examples of probabilistic classification GOOD NEWS: You can create your own hardware/software classifiers!
  • 7.
    Probability Basics • Wedefined prior, conditional and joint probability for random variables – Prior probability: – Conditional probability: – Joint probability: – Relationship: – Independence: • Bayesian Rule ) | , ) ( 1 2 1 X P(X X | X P 2 ) ( ) ( ) ( ) ( X X X P C P C | P | C P  ) (X P ) ) ( ), , ( 2 2 ,X P(X P X X 1 1   X X ) ( ) | ( ) ( ) | ( ) 2 2 1 1 1 2 2 X P X X P X P X X P ,X P(X1   ) ( ) ( ) ), ( ) | ( ), ( ) | ( 2 1 2 1 2 1 2 1 2 X P X P ,X P(X X P X X P X P X X P 1    Evidence Prior Likelihood Posterior  
  • 8.
    Method: Probabilistic Classificationwith MAP • MAP classification rule – MAP: Maximum A Posterior – Assign x to c* if • Method of Method of Generative classification with the MAP rule 1. Apply Bayesian rule to convert them into posterior probabilities 2. Then apply the MAP rule L c , , c c c c | c C P | c C P           1 * * , ) ( ) ( x X x X L i c C P c C | P P c C P c C | P | c C P i i i i i , , 2 , 1 for ) ( ) ( ) ( ) ( ) ( ) (                x X x X x X x X We use this rule in many applications
  • 9.
  • 10.
    Naïve Bayes • Bayesclassification Bayes classification Difficulty: learning the joint probability • Naïve Bayes classification Naïve Bayes classification – Assumption that all input attributes are conditionally independent! – MAP classification rule: for ) ( ) | , , ( ) ( ) ( ) ( 1 C P C X X P C P C | P | C P n      X X ) | , , ( 1 C X X P n    ) | ( ) | ( ) | ( ) | , , ( ) | ( ) | , , ( ) , , , | ( ) | , , , ( 2 1 2 1 2 2 1 2 1 C X P C X P C X P C X X P C X P C X X P C X X X P C X X X P n n n n n                   L n n c c c c c c P c x P c x P c P c x P c x P , , , ), ( )] | ( ) | ( [ ) ( )] | ( ) | ( [ 1 * 1 * * * 1             ) , , , ( 2 1 n x x x     x For a class, the previous generative model can be decomposed by n n generative models of a single input. Product of individual probabilities
  • 11.
    Naïve Bayes Algorithm NaïveBayes Algorithm • Naïve Bayes Algorithm (for discrete input attributes) has two phases – 1. Learning Phase 1. Learning Phase: Given a training set S, Output: conditional probability tables; for elements – 2. Test Phase 2. Test Phase: Given an unknown instance , Look up tables to assign the label c* to X’ if ; in examples with ) | ( estimate ) | ( ˆ ) , 1 ; , , 1 ( attribute each of value attribute every For ; in examples with ) ( estimate ) ( ˆ of value target each For 1 S S i jk j i jk j j j jk i i L i i c C x X P c C x X P N , k n j X x c C P c C P ) c , , c (c c                     L n n c c c c c c P c a P c a P c P c a P c a P , , , ), ( ˆ )] | ( ˆ ) | ( ˆ [ ) ( ˆ )] | ( ˆ ) | ( ˆ [ 1 * 1 * * * 1                 ) , , ( 1 n a a        X L N X j j  , Classification is easy, just multiply probabilities Learning is easy, just create probability tables.
  • 12.
  • 13.
    The The learning phasefor tennis for tennis example example Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Play=Ye s Play=N o High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 We have four variables, we calculate for each we calculate the conditional probability table
  • 14.
    Formulation of aClassification Problem •Given the data as found in last slide: •Find for a new point in space (vector of values) to which group it belongs (classify)
  • 15.
    The test phase testphase for the tennis example • Test Phase – Given a new instance of variable values, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) – Given calculated Look up tables – Use the Use the MAP rule MAP rule to calculate Yes or No to calculate Yes or No P(Outlook=Sunny|Play=No No) = 3/5 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=No) = 3/5 P(Play=No) = 5/14 P(Outlook=Sunny|Play=Yes Yes) = 2/9 P(Temperature=Cool|Play=Yes) = 3/9 P(Huminity=High|Play=Yes) = 3/9 P(Wind=Strong|Play=Yes) = 3/9 P(Play=Yes) = 9/14 P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P( Given the fact P(Yes Yes| |x x’) < P( ’) < P(No No| |x x’), we label ’), we label x x’ to be “ ’ to be “No No”. ”.
  • 16.
    Issues Relevant Issues Relevant toNaïve Bayes to Naïve Bayes
  • 17.
    Issues Relevant toNaïve Bayes Issues Relevant to Naïve Bayes 1. Violation of Independence Assumption 2. Zero conditional probability Problem
  • 18.
    Issues Relevant toNaïve Bayes Issues Relevant to Naïve Bayes 1. Violation of Independence Assumption • For many real world tasks, • Nevertheless, naïve Bayes works surprisingly well anyway! ) | ( ) | ( ) | , , ( 1 1 C X P C X P C X X P n n        Events are correlated First Issue
  • 19.
    Issues Relevant toNaïve Bayes Issues Relevant to Naïve Bayes 1. Zero conditional probability Problem – Such problem exists when no example contains the attribute value – In this circumstance, during test – For a remedy, conditional probabilities are estimated with ) | ( ) | ( ) | , , ( 1 1 C X P C X P C X X P n n        0 ) | ( ˆ ,     i jk j jk j c C a X P a X 0 ) | ( ˆ ) | ( ˆ ) | ( ˆ 1        i n i jk i c x P c a P c x P ) 1 examples, virtual" " of (number prior to weight : ) of values possible for / 1 (usually, estimate prior : which for examples training of number : C and which for examples training of number : ) | ( ˆ           m m X t t p p c C n c a X n m n mp n c C a X P j i i jk j c c i jk j Second Issue
  • 20.
    Another Problem: Another Problem:Continuous-valued Continuous-valued Input Attributes Input Attributes • What to do in such a case? – Numberless values for an attribute – Conditional probability is then modeled with the normal distribution – Learning Phase: Output: normal distributions and – Test Phase: 1. Calculate conditional probabilities with all the normal distributions 2. Apply the MAP rule to make a decision i j ji i j ji ji ji j ji i j c C c X X c C X P               which for examples of X values attribute of deviation standard : C which for examples of values attribute of (avearage) mean : 2 ) ( exp 2 1 ) | ( ˆ 2 2       L n c c C X X , , ), , , ( for 1 1         X L n ) , , ( for 1 n X X        X L i c C P i , , 1 ) (     
  • 21.
    Naïve Bayes Training •Now that we’ve decided to use a Naïve Bayes classifier, we need to train it with some data: 21
  • 22.
    Naïve Bayes Training •Training in Naïve Bayes is easy: • –Estimate P(Y=v) as the fraction of records with Y=v • Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for which Xi=u
  • 23.
    Naïve Bayes Training •In practice, some of these counts can be zero • Fix this by adding “virtual” counts: • This is called Smoothing
  • 24.
    Naïve Bayes Training •For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.
  • 25.
  • 26.
    Conclusion on classifiers Conclusionon classifiers • Naïve Bayes is based on the independence assumption • Training is very easy and fast; just requiring considering each attribute in each class separately • Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions • Naïve Bayes is a popular generative classifier model 1. 1. Performance of naïve Bayes Performance of naïve Bayes is competitive to most of state-of-the-art classifiers even if in presence of violating the independence assumption 2. It has many successful applications, e.g., spam mail filtering 3. A good candidate of a base learner in ensemble learning 4. Apart from classification, naïve Bayes can do more…
  • 27.
  • 28.
    R Example (http://uc-r.github.io/naive_bayes) •To illustrate the naïve Bayes classifier we will use the attrition data that has been included in the rsample package. The goal is to predict employee attrition. library(rsample) # data splitting library(dplyr) # data transformation library(ggplot2) # data visualization library(caret) # implementing with caret library(h2o) # implementing with h2o
  • 29.
    # convert somenumeric variables to factors attrition <- attrition %>% mutate( JobLevel = factor(JobLevel), StockOptionLevel = factor(StockOptionLevel), TrainingTimesLastYear = factor(TrainingTimesLastYear) ) # Create training (70%) and test (30%) sets for the attrition data. # Use set.seed for reproducibility set.seed(123) split <- initial_split(attrition, prop = .7, strata = "Attrition") train <- training(split) test <- testing(split) # distribution of Attrition rates across train & test set table(train$Attrition) %>% prop.table() No Yes 0.838835 0.161165 table(test$Attrition) %>% prop.table() No Yes 0.8386364 0.1613636
  • 30.
    • With naïveBayes, we assume that the predictor variables are conditionally independent of one another given the response value. This is an extremely strong assumption. We can see quickly that our attrition data violates this as we have several moderately to strongly correlated variables. library(corrplot) train %>% filter(Attrition == "Yes") %>% select_if(is.numeric) %>% cor() %>% corrplot::corrplot()
  • 31.
    • However, bymaking this assumption we can simplify our calculation such that the posterior probability is simply the product of the probability distribution for each individual variable conditioned on the response category. • For categorical variables, this computation is quite simple as you just use the frequencies from the data. However, when including continuous predictor variables often an assumption of normality is made so that we can use the probability from the variable’s probability density function. If we pick a handful of our numeric features we quickly see assumption of normality is not always fair. train %>% select(Age, DailyRate, DistanceFromHome, HourlyRate, MonthlyIncome, MonthlyRate) %>% gather(metric, value) %>% ggplot(aes(value, fill = metric)) + geom_density(show.legend = FALSE) + facet_wrap(~ metric, scales = "free")
  • 32.
    • Some numericfeatures may be normalized with a Box-Cox transformation; however, we can also use non-parametric kernel density estimators to try get a more accurate representation of continuous variable probabilities. Ultimately, transforming the distributions and selecting an estimator is part of the modeling development and tuning process. • Laplace Smoother • We run into a serious problem when new data includes a feature value that never occurs for one or more levels of a response class. What results is P(xi|Ck)=0 for this individual feature and this zero will ripple through the entire multiplication of all features and will always force the posterior probability to be zero for that class. • A solution to this problem involves using the Laplace smoother. The Laplace smoother adds a small number to each of the counts in the frequencies for each feature, which ensures that each feature has a nonzero probability of occurring for each class. Typically, a value of one to two for the Laplace smoother is sufficient, but this is a tuning parameter to incorporate and optimize with cross validation.
  • 33.
    • First, weapply a naïve Bayes model with 10-fold cross validation, which gets 83% accuracy. Considering about 83% of our observations in our training set do not attrit, our overall accuracy is no better than if we just predicted “No” attrition for every observation. • There are several packages to apply naïve Bayes (i.e. e1071, klaR, naivebayes, bnclassify). library(klaR) # create response and feature data features <- setdiff(names(train), "Attrition") x <- train[, features] y <- train$Attrition # set up 10-fold cross validation procedure train_control <- trainControl( method = "cv", number = 10 ) # train model nb.m1 <- train( x = x, y = y, method = "nb", trControl = train_control ) # results confusionMatrix(nb.m1) Cross-Validated (10 fold) Confusion Matrix (entries are percentual average cell counts across resamples) Reference Prediction No Yes No 75.2 8.1 Yes 8.6 8.1 Accuracy (average) : 0.833
  • 34.
    • We cantune the few hyperparameters that a naïve Bayes model has. • Usekernel parameter allows us to use a kernel density estimate for continuous variables versus a guassian density estimate, • adjust allows us to adjust the bandwidth of the kernel density (larger numbers mean more flexible density estimate), • fL allows us to incorporate the Laplace smoother. • If we just tuned our model with the above parameters we are able to lift our accuracy to 85%; however, by incorporating some preprocessing of our features (normalize with Box Cox, standardize with center-scaling, and reducing with PCA) we actually get about another 2% lift in our accuracy.
  • 35.
    # set uptuning grid search_grid <- expand.grid( usekernel = c(TRUE, FALSE), fL = 0:5, adjust = seq(0, 5, by = 1) ) # train model nb.m2 <- train( x = x, y = y, method = "nb", trControl = train_control, tuneGrid = search_grid, preProc = c("BoxCox", "center", "scale", "pca") ) # top 5 modesl nb.m2$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy)) usekernel fL adjust Accuracy Kappa AccuracySD KappaSD 1 TRUE 1 3 0.8738671 0.4249084 0.02902693 0.1516638 2 TRUE 2 3 0.8690688 0.4555071 0.03590654 0.1593024 3 TRUE 3 4 0.8690408 0.4404052 0.03069306 0.1545643 4 TRUE 0 2 0.8680792 0.4177358 0.03513233 0.1592905 5 TRUE 3 5 0.8680133 0.3697180 0.02361187 0.1284321
  • 36.
    # plot searchgrid results plot(nb.m2)
  • 37.
    # results forbest model confusionMatrix(nb.m2) Cross-Validated (10 fold) Confusion Matrix (entries are percentual average cell counts across resamples) Reference Prediction No Yes No 81.2 9.9 Yes 2.7 6.2 Accuracy (average) : 0.8738 We can assess the accuracy on our final holdout test set. pred <- predict(nb.m2, newdata = test) confusionMatrix(pred, test$Attrition) Confusion Matrix and Statistics Reference Prediction No Yes No 349 41 Yes 20 30 Accuracy : 0.8614 95% CI : (0.8255, 0.8923) No Information Rate : 0.8386 P-Value [Acc > NIR] : 0.10756 Kappa : 0.4183 Mcnemar's Test P-Value : 0.01045 Sensitivity : 0.9458 Specificity : 0.4225 Pos Pred Value : 0.8949 Neg Pred Value : 0.6000 Prevalence : 0.8386 Detection Rate : 0.7932 Detection Prevalence : 0.8864 Balanced Accuracy : 0.6842 'Positive' Class : No
  • 38.
    Some Other Sources •https://web.stanford.edu/class/cs124/lec/naivebayes.pptx • https://cse.sc.edu/~rose/587/PPT/NaiveBayes.ppt • http://studentnet.cs.manchester.ac.uk/ugt/COMP24111/materials/ slides/Naive-Bayes.ppt • http://www.cs.kent.edu/~jin/DM11/Naive-Bayes.ppt