Cross-project Defect Prediction
Using a Connectivity-based
Unsupervised Classifier
Feng Zhang Quan Zheng Ying Zou Ahmed E. Hassan
Defect prediction
Training
Defect prediction
Past data to build the model
Training Target
Past data to build the model New
Defect prediction
Training Target
Past data to build the model New
Within-project defect prediction
Target
Past data to build the model
Historical data may not be available
Target
Historical data may not be available
Other projects as training data
Target
Target
Cross-project defect prediction
Train-
ing
Software
metrics
Defect
data
Cross-project defect prediction
Software
metrics
Defect
data
Cross-project defect prediction
Training
project
Supervised
classifier
Software
metrics
Defect
data
Cross-project defect prediction
Training
project
Supervised
classifier
Software
metrics
Defect
data
Cross-project defect prediction
Software
metrics
Training
project
Supervised
classifier
Software
metrics
Defect
data
Cross-project defect prediction
Software
metrics
Target
project
Training
project
Supervised
classifier
Software
metrics
Defect
data
Cross-project defect prediction
Software
metrics
Target
project
Training
project
Supervised
classifier
Software
metrics
Defect
data
Cross-project defect prediction
Software
metrics
Target
project
Defect
proneness
Training
project
Heterogeneity across projects
(ICSM 2013)
Supervised
classifier
Software
metrics
Defect
data
Cross-project defect prediction
Software
metrics
Target
project
Defect
proneness
Training
project
Supervised
classifier
Software
metrics
Defect
data
Cross-project defect prediction
Software
metrics
Target
project
Defect
proneness
Training
project
Heterogeneity
Supervised
classifier
Software
metrics
Defect
data
Cross-project defect prediction
Software
metrics
Target
project
Defect
proneness
Training
project
Heterogeneity
Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
Heterogeneity
Our Previous Solution
(MSR 2014)
Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
Heterogeneity
Our Previous Solution
(MSR 2014)
Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
How About Using Unsupervised Classifiers?
Unsupervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
How About Using Unsupervised Classifiers?
Unsupervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
How About Using Unsupervised Classifiers?
Heterogeneity
Unsupervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
How About Using Unsupervised Classifiers?
HeterogeneityInitial attempts using K-means
were not very successful.
How About Using Unsupervised Classifiers?
How About Using Unsupervised Classifiers?
Short distance
How About Using Unsupervised Classifiers?
Short distance
How About Using Unsupervised Classifiers?
Long distance
Long distance
How About Using Unsupervised Classifiers?
Long distance
Long distance
How About Using Unsupervised Classifiers?
Connections
Connections
Social network
c
Far away in distance but may be connected !c
Far away in distance but may be connected !
Far away in distance but may be connected !
Connection is more important
than distance.
Far away in distance but may be connected !
Are defective software entities
connected to each other?
Within-community and cross-community
connections
Stronger Stronger
Weaker
Within-community and cross-community
connections
Stronger Stronger
Weaker
Defective entities tend to connect
to other defective entities.
Within-community and cross-community
connections
Our connectivity-based
unsupervised approach
Consider each entity (file/class) as a node
Step 1. Compute software metrics
Step 2. Build a graph based on the similarity
Step 3. Make a bipartition on the graph
Step 4. Label the defective cluster
Defective Clean
17 lines of R code is provided in the paper
Looks simple? Does it really work?
Research questions
RQ1. How does the spectral clustering based
classifier perform in cross-project defect
prediction?
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?
Subject projects (Total: 26)
Equinox JDT Lucene Mylyn PDE
AEEEM (5 projects)
Subject projects (Total: 26)
Equinox JDT Lucene Mylyn PDE
AEEEM (5 projects)
CM1 JM1 KC3 MC1 MC2 MW1
NASA (11 projects)
PC1 PC2 PC3 PC4 PC5
Subject projects (Total: 26)
Subject projects (Total: 26)
Equinox JDT Lucene Mylyn PDE
AEEEM (5 projects)
CM1 JM1 KC3 MC1 MC2 MW1
NASA (11 projects)
PC1 PC2 PC3 PC4 PC5
PROMISE (10 projects)
Ant Camel Ivy Jedit Log4j
Lucene POI Tomcat Xalan Xerces
Classifiers for comparison (Total: 9)
Unsupervised
1. K-means clustering (KM)
2. Partition around medoids (PAM)
3. Fuzzy C-means (FCM)
4. Neural-gas (NG)
Classifiers for comparison (Total: 9)
Unsupervised
1. K-means clustering (KM)
2. Partition around medoids (PAM)
3. Fuzzy C-means (FCM)
4. Neural-gas (NG)
Supervised
1. Random forest (RF)
2. Naïve Bayes (NB)
3. Logistic regression (LR)
4. Decision tree (DT)
5. Logistic model tree (LMT)
Classifiers for comparison (Total: 9)
RQ1. How does the spectral clustering
based classifier perform in cross-project
defect prediction?
NASA
AEEEM
PROMISE
RQ1. How does the spectral clustering
based classifier perform in cross-project
defect prediction?
…
…
…
NASA
AEEEM
PROMISE
RQ1. How does the spectral clustering
based classifier perform in cross-project
defect prediction?
…
…
…
NASA
AEEEM
PROMISE
Average
AUC
Average
AUC
Average
AUC
RQ1. How does the spectral clustering
based classifier perform in cross-project
defect prediction?
…
…
…
Average
AUC
Average
AUC
Average
AUC
NASA
AEEEM
PROMISE
Rank classifiers
(Scott-Knott Test)
RQ1. How does the spectral clustering
based classifier perform in cross-project
defect prediction?
RQ1. Results (cross-project)
Red text:
Unsupervised
Blue text:
Supervised
Rank 1
Rank 2
Rank 3
Rank 4
RQ1. Results (cross-project)
Red text:
Unsupervised
Blue text:
Supervised
Rank 1
Rank 2
Rank 3
Rank 4
RQ1. Results (cross-project)
Red text:
Unsupervised
Blue text:
Supervised
Rank 1
Rank 2
Rank 3
Rank 4
RQ1. Results (cross-project)
Our approach can compete with
supervised classifiers under study,
and sometime is even better.
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?
50%
50%
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?
50%
50%
AUCTraining Testing
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?
50%
50%
AUCTraining
Training
Testing
Testing AUC
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?
50%
50%
AUCTraining
Training
Testing
Testing AUC
50%
50%
AUCTraining
Training
Testing
Testing AUC
…
(500 random splits, thus 1,000 evaluations)
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?
50%
50%
AUC
Rank classifiers
(Scott-Knott Test)
Training
Training
Testing
Testing AUC
50%
50%
AUCTraining
Training
Testing
Testing AUC
…
(500 random splits, thus 1,000 evaluations)
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?
RQ2. Results (within-project)
RQ2. Results (within-project)
1
Random forest
Gold
RQ2. Results (within-project)
1
2
Random forest
Logistic regression
Spectral clustering
Logistic model tree
Naïve Bayes
Silver Gold
1
2 3
Random forest
Logistic regression
Spectral clustering
Logistic model tree
Naïve Bayes
Fuzzy C-means
RQ2. Results (within-project)
Silver BronzeGold
1
2 3
Random forest
Logistic regression
Spectral clustering
Logistic model tree
Naïve Bayes
Fuzzy C-means
RQ2. Results (within-project)
Silver BronzeGold
Our approach can achieve similar
performance as supervised classifiers,
except random forest.
Summary
Feng Zhang
(feng@cs.queensu.ca) (http://www.feng-zhang.com)

Cross-project Defect Prediction Using A Connectivity-based Unsupervised Classifier