Software Analytics: 
Towards Software Mining that Matters 
Tao Xie 
Department of Computer Science 
University of Illinois at Urbana-Champaign, USA 
taoxie@illinois.edu 
In Collaboration with Microsoft Research
Machine Learning that Matters 
“The basic argument in her paper is that machine learning 
might be in danger of losing its impact because the 
community as a whole has become quite self-referential. 
People are probably solving real-world problems using ML 
methods, but there is little sharing of these results within 
the community. Instead, people focus on existing 
benchmarks which might have originally had some 
connection to real-world problems which has been long 
forgotten, however.” 
“She proposes a number of tasks like $100M solved 
through ML based decision making or a human life saved 
through a diagnosis or an intervention recommended by 
an ML system to get ML back on track.” 
ICML’12 
http://icml.cc/2012/papers/298.pdf 
http://blog.mikiobraun.de/2012/06/is-machine-learning-losing-impact.html
2012 NSF Workshop on Formal Methods 
• Goal: to identify the future directions in research in 
formal methods and its transition to industrial 
practice. 
• Success examples mentioned by the attendees 
– SLAM/SDV 
– ASTREE 
– SMT-based tools 
– … 
http://goto.ucsd.edu/~rjhala/NSFWorkshop/
“What Happened to the Promise 
of Software Tools?” – Jim Larus 
http://www.srl.inf.ethz.ch/workshop2014/eth-larus.pdf 
https://www.youtube.com/watch?v=kO9OYnkeRTM
Software Analytics 
Software analytics is to enable software 
practitioners to perform data exploration and 
analysis in order to obtain insightful and 
actionable information for data-driven tasks 
around software and services. 
Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software 
Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011 
http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
Software Analytics 
Software analytics is to enable software 
practitioners to perform data exploration and 
analysis in order to obtain insightful and 
actionable information for data-driven tasks 
around software and services. 
http://research.microsoft.com/en-us/groups/sa/ 
http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
“What Happened to the Promise 
of Software Tools?” – Jim Larus 
http://www.srl.inf.ethz.ch/workshop2014/eth-larus.pdf 
https://www.youtube.com/watch?v=kO9OYnkeRTM
http://research.microsoft.com/en-us/groups/sa/stackmine_icse2012.pdf 
http://research.microsoft.com/en-us/groups/sa/ieeesoft13-softanalytics.pdf 
StackMine 
Performance debugging in the large via 
mining millions of stack traces
Performance debugging in the large 
Pattern Matching 
Trace Storage 
Trace collection 
Bug update 
Problematic Pattern 
Repository Bug Database 
Network 
Bug filing 
Key to issue 
discovery 
Trace analysis
Performance debugging in the large 
Pattern Matching 
Trace Storage 
Trace collection 
Bug update 
Problematic Pattern 
Repository Bug Database 
Network 
Bug filing 
Key to issue 
discovery 
Bottleneck of 
scalability 
Trace analysis
Performance debugging in the large 
Pattern Matching 
Trace Storage 
Trace collection 
Bug update 
Problematic Pattern 
Repository Bug Database 
Network 
Trace analysis 
How many issues are 
still unknown? 
Bug filing 
Key to issue 
discovery 
Bottleneck of 
scalability
Performance debugging in the large 
Pattern Matching 
Trace Storage 
Trace collection 
Bug update 
Problematic Pattern 
Repository Bug Database 
Network 
Trace analysis 
How many issues are 
still unknown? 
Which trace file should I 
investigate first? 
Bug filing 
Key to issue 
discovery 
Bottleneck of 
scalability
Technical highlights 
• Data mining for software domain 
– Discovery of problematic execution patterns formulated as 
callstack mining & clustering 
– Domain knowledge incorporated systematically 
• Interactive performance analysis system 
– Parallel mining infrastructure based on HPC + MPI 
– Visualization aided interactive exploration
Impact: Debugging Productivity Boost 
“We believe that the MSRA tool is highly valuable and much more 
efficient for mass trace (100+ traces) analysis. For 1000 traces, we 
believe the tool saves us 4-6 weeks of time to create new signatures, 
which is quite a significant productivity boost.” 
Highly effective new issue discovery on Windows 
mini-hang 
Continuous impact on future Windows 
versions
http://research.microsoft.com/en-us/groups/sa/xiao_acsac12_camerareadyfinal.pdf 
XIAO 
Scalable code clone analysis 
2012
XIAO: Code Clone Analysis 
• Motivation 
– Copy-and-paste is a common developer behavior 
– A real tool widely adopted internally and externally 
• XIAO enables code clone analysis in the following way 
– High tunability 
– High scalability 
– High compatibility 
– High explorability
High tunability – what you tune is what you get 
• Intuitive similarity metric 
– Effective control of the degree of syntactical differences between two code snippets 
• Tunable at fine granularity 
– Statement similarity 
– % of inserted/deleted/modified statements 
– Balance between code structure and disordered statements 
for (i = 0; i < n; i ++) { 
a ++; 
b ++; 
c = foo(a, b); 
d = bar(a, b, c); 
e = a + c; } 
for (i = 0; i < n; i ++) { 
c = foo(a, b); 
a ++; 
b ++; 
d = bar(a, b, c); 
e = a + d; 
e ++; }
High explorability 
1 2 3 4 5 6 
1. Clone navigation based on source tree hierarchy 
2. Pivoting of folder level statistics 
3. Folder level statistics 
4. Clone function list in selected folder 
5. Clone function filters 
6. Sorting by bug or refactoring potential 
7. Tagging 
7 
1 
1. Block correspondence 
2. Block types 
3. Block navigation 
4. Copying 
5. Bug filing 
6. Tagging 
2 
4 
3 
6 
1 
5
Scenarios & Solutions 
Quality gates at milestones 
• Architecture refactoring 
• Code clone clean up 
• Bug fixing 
Post-release maintenance 
• Security bug investigation 
• Bug investigation for sustained engineering 
Development and testing 
• Checking for similar issues before check-in 
• Reference info for code review 
• Supporting tool for bug triage 
Online code clone search 
Offline code clone analysis
Impact: Benefiting developer community 
Available in Visual Studio 2012 RC 
Searching similar snippets 
for fixing bug once 
Finding refactoring 
opportunity
Impact: More secure Microsoft products 
Code Clone Search service integrated into 
workflow of Microsoft Security Response Center 
Over 590 million lines of code indexed across 
multiple products 
Real security issues proactively identified and 
addressed
Example – MS Security Bulletin MS12-034 
Combined Security Update for Microsoft Office, Windows, .NET Framework, and 
Silverlight, published: Tuesday, May 08, 2012 
3 publicly disclosed vulnerabilities and 7 privately reported involved. Specifically, 1 is 
exploited by the Duqu malware to execute arbitrary code when a user opened a 
malicious Office document 
Insufficient bounds check within the font parsing subsystem of win32k.sys 
Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer 
Microsoft Technet Blog about this bulletin 
However, we wanted to be sure to address the vulnerable code wherever it appeared 
across the Microsoft code base. To that end, we have been working with Microsoft 
Research to develop a “Cloned Code Detection” system that we can run for every 
MSRC case to find any instance of the vulnerable code in any shipping product. This 
system is the one that found several of the copies of CVE-2011-3402 that we are 
now addressing with MS12-034.
http://research.microsoft.com/apps/pubs/?id=202451 
SAS 
Incident management of online services
Motivation 
• Online services are increasingly popular & important 
• High service quality is the key 
Incident Management (IcM) is a critical task to 
assure service quality
Incident Management: Workflow 
Detect a 
service 
issue 
Alert On- 
Call 
Engineers 
(OCEs) 
Investigate 
the problem 
Restore 
the 
service 
Fix root cause 
via 
postmortem 
analysis
SAS: Incident management of online services 
SAS, developed and deployed to effectively reduce MTTR 
(Mean Time To Restore) via automatically analyzing 
monitoring data 
2 
6 
 Design Principle of SAS 
 Automating Analysis 
 Handling Heterogeneity 
 Accumulating Knowledge 
 Supporting human-in-the-loop (HITL)
Techniques Overview 
• System metrics 
– Identifying Incident Beacons 
• Transaction logs 
– Mining Suspicious Execution Patterns 
• Historical incidents 
– Mining Historical Workaround Solutions
Industry Impact of SAS 
Deployment 
• SAS deployed to 
worldwide datacenters for 
Service X (serving 
hundreds of millions of 
users) since June 2011 
• OCEs now heavily depend 
on SAS 
Usage 
• SAS helped successfully 
diagnose ~76% of the 
service incidents assisted 
with SAS
http://web.engr.illinois.edu/~taoxie/publications/icse13see-pex4fun.pdf 
Coding Duels (Code Hunt/Pex4Fun) 
Teaching/Learning Programming/Software Engineering via 
Interactive Gaming
Code Hunt Competition for Students 
https://www.codehunt.com/ 
Precursor: http://www.pex4fun.com/
A Fun and Engaging Game – Win by Writing Code Supports Java and C# 
Adapts to competitions as well as individual play 
Users: 
1,181,152 
User Programs: 
7,079,497 
WWW.CODEHUNT.COM
Behind the Scene of Coding Duel 
Secret Implementation 
class Secret { 
public static int Puzzle(int x) { 
if (x <= 0) return 1; 
return x * Puzzle(x-1); 
} 
} 
Player Implementation 
class Player { 
public static int Puzzle(int x) { 
return x; 
} 
} 
class Test { 
public static void Driver(int x) { 
if (Secret.Puzzle(x) != Player.Puzzle(x)) 
throw new Exception(“Mismatch”); 
} 
} 
behavior 
Secret Impl == Player Impl 
33
Experience Reports on Successful Tool Transfer 
• Nikolai Tillmann, Jonathan de Halleux, and Tao Xie. Transferring an Automated Test 
Generation Tool to Practice: From Pex to Fakes and Code Digger. In Proceedings of ASE 
2014, Experience Papers. http://web.engr.illinois.edu/~taoxie/publications/ase14- 
pexexperiences.pdf 
• Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. Software 
Analytics for Incident Management of Online Services: An Experience Report. In 
Proceedings ASE 2013, Experience Paper. 
http://web.engr.illinois.edu/~taoxie/publications/ase13-sas.pdf 
• Dongmei Zhang, Shi Han, Yingnong Dang, Jian-Guang Lou, Haidong Zhang, and Tao Xie. 
Software Analytics in Practice. IEEE Software, Special Issue on the Many Faces of Software 
Analytics, 2013. http://web.engr.illinois.edu/~taoxie/publications/ieeesoft13-softanalytics.pdf 
• Yingnong Dang, Dongmei Zhang, Song Ge, Chengyun Chu, Yingjun Qiu, and Tao Xie. XIAO: 
Tuning Code Clones at Hands of Engineers in Practice. In Proceedings of ACSAC 2012. 
http://web.engr.illinois.edu/~taoxie/publications/acsac12-xiao.pdf
Ex: Human Consumption of Tool Outputs 
• Developer: Your tool generated “0” 
• Pex team: What did you expect? 
• Developer: Marc 
Invariant candidates: 
this.getPrice() > 0 
this.getPrice() >= 0 
http://www.agitar.com/ http://research.microsoft.com/projects/pex/
Q & A 
Contact: taoxie@illinois.edu 
http://research.microsoft.com/en-us/groups/sa/ 
http://www.cs.illinois.edu/homes/taoxie/ 
Supported in part by a Microsoft Research Award, NSF grants CCF-1349666, CNS-1434582, CCF-1434596, CCF- 
1434590, CNS-1439481, and the USA National Security Agency (NSA) Science of Security Lablet.

Software Analytics: Towards Software Mining that Matters (2014)

  • 1.
    Software Analytics: TowardsSoftware Mining that Matters Tao Xie Department of Computer Science University of Illinois at Urbana-Champaign, USA taoxie@illinois.edu In Collaboration with Microsoft Research
  • 2.
    Machine Learning thatMatters “The basic argument in her paper is that machine learning might be in danger of losing its impact because the community as a whole has become quite self-referential. People are probably solving real-world problems using ML methods, but there is little sharing of these results within the community. Instead, people focus on existing benchmarks which might have originally had some connection to real-world problems which has been long forgotten, however.” “She proposes a number of tasks like $100M solved through ML based decision making or a human life saved through a diagnosis or an intervention recommended by an ML system to get ML back on track.” ICML’12 http://icml.cc/2012/papers/298.pdf http://blog.mikiobraun.de/2012/06/is-machine-learning-losing-impact.html
  • 3.
    2012 NSF Workshopon Formal Methods • Goal: to identify the future directions in research in formal methods and its transition to industrial practice. • Success examples mentioned by the attendees – SLAM/SDV – ASTREE – SMT-based tools – … http://goto.ucsd.edu/~rjhala/NSFWorkshop/
  • 4.
    “What Happened tothe Promise of Software Tools?” – Jim Larus http://www.srl.inf.ethz.ch/workshop2014/eth-larus.pdf https://www.youtube.com/watch?v=kO9OYnkeRTM
  • 5.
    Software Analytics Softwareanalytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. Dongmei Zhang, Yingnong Dang, Jian-Guang Lou, Shi Han, Haidong Zhang, and Tao Xie. Software Analytics as a Learning Case in Practice: Approaches and Experiences. In MALETS 2011 http://research.microsoft.com/en-us/groups/sa/malets11-analytics.pdf
  • 6.
    Software Analytics Softwareanalytics is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services. http://research.microsoft.com/en-us/groups/sa/ http://research.microsoft.com/en-us/news/features/softwareanalytics-052013.aspx
  • 7.
    “What Happened tothe Promise of Software Tools?” – Jim Larus http://www.srl.inf.ethz.ch/workshop2014/eth-larus.pdf https://www.youtube.com/watch?v=kO9OYnkeRTM
  • 8.
  • 9.
    Performance debugging inthe large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Bug filing Key to issue discovery Trace analysis
  • 10.
    Performance debugging inthe large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Bug filing Key to issue discovery Bottleneck of scalability Trace analysis
  • 11.
    Performance debugging inthe large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Trace analysis How many issues are still unknown? Bug filing Key to issue discovery Bottleneck of scalability
  • 12.
    Performance debugging inthe large Pattern Matching Trace Storage Trace collection Bug update Problematic Pattern Repository Bug Database Network Trace analysis How many issues are still unknown? Which trace file should I investigate first? Bug filing Key to issue discovery Bottleneck of scalability
  • 13.
    Technical highlights •Data mining for software domain – Discovery of problematic execution patterns formulated as callstack mining & clustering – Domain knowledge incorporated systematically • Interactive performance analysis system – Parallel mining infrastructure based on HPC + MPI – Visualization aided interactive exploration
  • 14.
    Impact: Debugging ProductivityBoost “We believe that the MSRA tool is highly valuable and much more efficient for mass trace (100+ traces) analysis. For 1000 traces, we believe the tool saves us 4-6 weeks of time to create new signatures, which is quite a significant productivity boost.” Highly effective new issue discovery on Windows mini-hang Continuous impact on future Windows versions
  • 15.
  • 16.
    XIAO: Code CloneAnalysis • Motivation – Copy-and-paste is a common developer behavior – A real tool widely adopted internally and externally • XIAO enables code clone analysis in the following way – High tunability – High scalability – High compatibility – High explorability
  • 17.
    High tunability –what you tune is what you get • Intuitive similarity metric – Effective control of the degree of syntactical differences between two code snippets • Tunable at fine granularity – Statement similarity – % of inserted/deleted/modified statements – Balance between code structure and disordered statements for (i = 0; i < n; i ++) { a ++; b ++; c = foo(a, b); d = bar(a, b, c); e = a + c; } for (i = 0; i < n; i ++) { c = foo(a, b); a ++; b ++; d = bar(a, b, c); e = a + d; e ++; }
  • 18.
    High explorability 12 3 4 5 6 1. Clone navigation based on source tree hierarchy 2. Pivoting of folder level statistics 3. Folder level statistics 4. Clone function list in selected folder 5. Clone function filters 6. Sorting by bug or refactoring potential 7. Tagging 7 1 1. Block correspondence 2. Block types 3. Block navigation 4. Copying 5. Bug filing 6. Tagging 2 4 3 6 1 5
  • 19.
    Scenarios & Solutions Quality gates at milestones • Architecture refactoring • Code clone clean up • Bug fixing Post-release maintenance • Security bug investigation • Bug investigation for sustained engineering Development and testing • Checking for similar issues before check-in • Reference info for code review • Supporting tool for bug triage Online code clone search Offline code clone analysis
  • 20.
    Impact: Benefiting developercommunity Available in Visual Studio 2012 RC Searching similar snippets for fixing bug once Finding refactoring opportunity
  • 21.
    Impact: More secureMicrosoft products Code Clone Search service integrated into workflow of Microsoft Security Response Center Over 590 million lines of code indexed across multiple products Real security issues proactively identified and addressed
  • 22.
    Example – MSSecurity Bulletin MS12-034 Combined Security Update for Microsoft Office, Windows, .NET Framework, and Silverlight, published: Tuesday, May 08, 2012 3 publicly disclosed vulnerabilities and 7 privately reported involved. Specifically, 1 is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document Insufficient bounds check within the font parsing subsystem of win32k.sys Cloned copy in gdiplus.dll, ogl.dll (office), Silver Light, Windows Journal viewer Microsoft Technet Blog about this bulletin However, we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base. To that end, we have been working with Microsoft Research to develop a “Cloned Code Detection” system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product. This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034.
  • 23.
  • 24.
    Motivation • Onlineservices are increasingly popular & important • High service quality is the key Incident Management (IcM) is a critical task to assure service quality
  • 25.
    Incident Management: Workflow Detect a service issue Alert On- Call Engineers (OCEs) Investigate the problem Restore the service Fix root cause via postmortem analysis
  • 26.
    SAS: Incident managementof online services SAS, developed and deployed to effectively reduce MTTR (Mean Time To Restore) via automatically analyzing monitoring data 2 6  Design Principle of SAS  Automating Analysis  Handling Heterogeneity  Accumulating Knowledge  Supporting human-in-the-loop (HITL)
  • 27.
    Techniques Overview •System metrics – Identifying Incident Beacons • Transaction logs – Mining Suspicious Execution Patterns • Historical incidents – Mining Historical Workaround Solutions
  • 28.
    Industry Impact ofSAS Deployment • SAS deployed to worldwide datacenters for Service X (serving hundreds of millions of users) since June 2011 • OCEs now heavily depend on SAS Usage • SAS helped successfully diagnose ~76% of the service incidents assisted with SAS
  • 29.
    http://web.engr.illinois.edu/~taoxie/publications/icse13see-pex4fun.pdf Coding Duels(Code Hunt/Pex4Fun) Teaching/Learning Programming/Software Engineering via Interactive Gaming
  • 30.
    Code Hunt Competitionfor Students https://www.codehunt.com/ Precursor: http://www.pex4fun.com/
  • 31.
    A Fun andEngaging Game – Win by Writing Code Supports Java and C# Adapts to competitions as well as individual play Users: 1,181,152 User Programs: 7,079,497 WWW.CODEHUNT.COM
  • 32.
    Behind the Sceneof Coding Duel Secret Implementation class Secret { public static int Puzzle(int x) { if (x <= 0) return 1; return x * Puzzle(x-1); } } Player Implementation class Player { public static int Puzzle(int x) { return x; } } class Test { public static void Driver(int x) { if (Secret.Puzzle(x) != Player.Puzzle(x)) throw new Exception(“Mismatch”); } } behavior Secret Impl == Player Impl 33
  • 33.
    Experience Reports onSuccessful Tool Transfer • Nikolai Tillmann, Jonathan de Halleux, and Tao Xie. Transferring an Automated Test Generation Tool to Practice: From Pex to Fakes and Code Digger. In Proceedings of ASE 2014, Experience Papers. http://web.engr.illinois.edu/~taoxie/publications/ase14- pexexperiences.pdf • Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. Software Analytics for Incident Management of Online Services: An Experience Report. In Proceedings ASE 2013, Experience Paper. http://web.engr.illinois.edu/~taoxie/publications/ase13-sas.pdf • Dongmei Zhang, Shi Han, Yingnong Dang, Jian-Guang Lou, Haidong Zhang, and Tao Xie. Software Analytics in Practice. IEEE Software, Special Issue on the Many Faces of Software Analytics, 2013. http://web.engr.illinois.edu/~taoxie/publications/ieeesoft13-softanalytics.pdf • Yingnong Dang, Dongmei Zhang, Song Ge, Chengyun Chu, Yingjun Qiu, and Tao Xie. XIAO: Tuning Code Clones at Hands of Engineers in Practice. In Proceedings of ACSAC 2012. http://web.engr.illinois.edu/~taoxie/publications/acsac12-xiao.pdf
  • 34.
    Ex: Human Consumptionof Tool Outputs • Developer: Your tool generated “0” • Pex team: What did you expect? • Developer: Marc Invariant candidates: this.getPrice() > 0 this.getPrice() >= 0 http://www.agitar.com/ http://research.microsoft.com/projects/pex/
  • 35.
    Q & A Contact: taoxie@illinois.edu http://research.microsoft.com/en-us/groups/sa/ http://www.cs.illinois.edu/homes/taoxie/ Supported in part by a Microsoft Research Award, NSF grants CCF-1349666, CNS-1434582, CCF-1434596, CCF- 1434590, CNS-1439481, and the USA National Security Agency (NSA) Science of Security Lablet.