Cross-project Defect Prediction Using A Connectivity-based Unsupervised Classifier

Cross-project Defect Prediction
Using a Connectivity-based
Unsupervised Classifier
Feng Zhang Quan Zheng Ying Zou Ahmed E. Hassan

Training
Defect prediction
Past data to build the model

Training Target
Past data to build the model New
Defect prediction

Training Target
Past data to build the model New
Within-project defect prediction

Target
Past data to build the model
Historical data may not be available

Target
Historical data may not be available

Other projects as training data
Target

Target
Cross-project defect prediction
Train-
ing

Software
metrics
Defect
data

Software
metrics
Defect
data
Training
project

Supervised
classifier
Software
metrics
Defect
data
Training
project

Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Training
project

Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Training
project

Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project

Heterogeneity across projects
(ICSM 2013)

Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
Heterogeneity

Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
Heterogeneity
Our Previous Solution
(MSR 2014)

Supervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
How About Using Unsupervised Classifiers?

Unsupervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project

Unsupervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
Heterogeneity

Unsupervised
classifier
Software
metrics
Defect
data
Software
metrics
Target
project
Defect
proneness
Training
project
HeterogeneityInitial attempts using K-means
were not very successful.

Short distance

Long distance
Long distance

Connections
Connections

Far away in distance but may be connected !c

Far away in distance but may be connected !

Connection is more important
than distance.

Are defective software entities
connected to each other?

Within-community and cross-community
connections

Stronger Stronger
Weaker
connections

Stronger Stronger
Weaker
Defective entities tend to connect
to other defective entities.
connections

Our connectivity-based
unsupervised approach

Consider each entity (file/class) as a node

Step 1. Compute software metrics

Step 2. Build a graph based on the similarity

Step 3. Make a bipartition on the graph

Step 4. Label the defective cluster
Defective Clean

17 lines of R code is provided in the paper

Looks simple? Does it really work?

Research questions
RQ1. How does the spectral clustering based
classifier perform in cross-project defect
prediction?
RQ2. Does the spectral clustering based
classifier perform well in within-project
defect prediction?

Equinox JDT Lucene Mylyn PDE
AEEEM (5 projects)
Subject projects (Total: 26)

AEEEM (5 projects)
CM1 JM1 KC3 MC1 MC2 MW1
NASA (11 projects)
PC1 PC2 PC3 PC4 PC5

AEEEM (5 projects)
CM1 JM1 KC3 MC1 MC2 MW1
NASA (11 projects)
PC1 PC2 PC3 PC4 PC5
PROMISE (10 projects)
Ant Camel Ivy Jedit Log4j
Lucene POI Tomcat Xalan Xerces

Classifiers for comparison (Total: 9)

Unsupervised
1. K-means clustering (KM)
2. Partition around medoids (PAM)
3. Fuzzy C-means (FCM)
4. Neural-gas (NG)

Unsupervised
1. K-means clustering (KM)
2. Partition around medoids (PAM)
3. Fuzzy C-means (FCM)
4. Neural-gas (NG)
Supervised
1. Random forest (RF)
2. Naïve Bayes (NB)
3. Logistic regression (LR)
4. Decision tree (DT)
5. Logistic model tree (LMT)

RQ1. How does the spectral clustering
based classifier perform in cross-project
defect prediction?

NASA
AEEEM
PROMISE
defect prediction?

…
…
…
NASA
AEEEM
PROMISE
defect prediction?

…
…
…
NASA
AEEEM
PROMISE
Average
AUC
Average
AUC
Average
AUC
defect prediction?

…
…
…
Average
AUC
Average
AUC
Average
AUC
NASA
AEEEM
PROMISE
Rank classifiers
(Scott-Knott Test)
defect prediction?

Red text:
Unsupervised
Blue text:
Supervised
Rank 1
Rank 2
Rank 3
Rank 4
RQ1. Results (cross-project)

Red text:
Unsupervised
Blue text:
Supervised
Rank 1
Rank 2
Rank 3
Rank 4
RQ1. Results (cross-project)
Our approach can compete with
supervised classifiers under study,
and sometime is even better.

defect prediction?

50%
50%
defect prediction?

50%
50%
AUCTraining Testing
defect prediction?

50%
50%
AUCTraining
Training
Testing
Testing AUC
defect prediction?

50%
50%
AUCTraining
Training
Testing
Testing AUC
50%
50%
AUCTraining
Training
Testing
Testing AUC
…
(500 random splits, thus 1,000 evaluations)
defect prediction?

50%
50%
AUC
Rank classifiers
(Scott-Knott Test)
Training
Training
Testing
Testing AUC
50%
50%
AUCTraining
Training
Testing
Testing AUC
…
(500 random splits, thus 1,000 evaluations)
defect prediction?

RQ2. Results (within-project)
1
Random forest
Gold

1
2
Random forest
Logistic regression
Spectral clustering
Logistic model tree
Naïve Bayes
Silver Gold

1
2 3
Random forest
Logistic regression
Spectral clustering
Logistic model tree
Naïve Bayes
Fuzzy C-means
Silver BronzeGold

1
2 3
Random forest
Logistic regression
Spectral clustering
Logistic model tree
Naïve Bayes
Fuzzy C-means
Silver BronzeGold
Our approach can achieve similar
performance as supervised classifiers,
except random forest.

Feng Zhang
(feng@cs.queensu.ca) (http://www.feng-zhang.com)

Cross-project Defect Prediction Using A Connectivity-based Unsupervised Classifier

More Related Content

What's hot

Similar to Cross-project Defect Prediction Using A Connectivity-based Unsupervised Classifier

Recently uploaded

Cross-project Defect Prediction Using A Connectivity-based Unsupervised Classifier