Does static analysis need
machine learning?
Anti-Talk
Victoria Khanieva
PVS-Studio
Speaker
2
Victoria Khanieva
• С++ developer in PVS-Studio
• Supported the MISRA standard
• Wrote articles in checks of open-source
projects
khanieva@viva64.com
www.viva64.com
 Introduction to static analysis
 Existing solutions and approaches they implement
 Problems and pitfalls when creating an analyzer:
 When learning «manually»
 When learning on a real large code base
 Most promising approaches
Agenda
3
About the analysis
4
 Code review
Types of code analysis
5
 Code review
 Dynamic analysis
Types of code analysis
6
 Code review
 Dynamic analysis
 Static analysis
Types of code analysis
7
 How to reveal errors and flaws in the source code
of programs.
 Detect errors in programs
 Get tips on code formatting
 Count metrics
 ….
Static analysis
8
void createCube(float halfExtentsX,
float halfExtentsY,
float halfExtentsZ,
....){
....
m_model->addVertex(halfExtentsX,
halfExtentsY,
halfExtentsY,
....);
....
}
Diagnostics
9
void createCube(float halfExtentsX,
float halfExtentsY,
float halfExtentsZ,
....){
....
m_model->addVertex(halfExtentsX,
halfExtentsY,
halfExtentsY,
....);
....
}
Diagnostics
10
V751 Parameter 'halfExtentsZ' is not used inside function body.
TinyRenderer.cpp 375
You'd think…
11
When ML is useful
12
 Useful: Scanning photos and videos
When ML is useful
13
 Useful: Scanning photos and videos
 Unuseful: Calculator
When ML is useful
14
 Useful: Scanning photos and videos
 Unuseful: Calculator
Possible result
15
Existing solutions
16
Existing solutions
17
Existing solutions
18
 Java, JS, TS, Python, C, C++
 Code review and audit
 You can check out demos on an open-source project
 Related posts
DeepCode
19
Link
DeepCode
20
 Java, C, C++, Objective-C
 By Facebook
 Open-source code
 You can try Infer on your projects
 Based on the Хоара and separation logic,
bi-abduction, and the abstract interpretation
theory
Infer
21
Link
 Handles Infer results
 Suggests possible edits
SapFix
22
 Platform to analyze code quality
 System of edits suggestion
 Searches for dependencies
between functions and methods
by NLP
Embold
23
 Open-source
 Related posts
 Repository with dataset for learning
 Code-style detection
 Platform for collecting metrics and statistics
Source{d}
24
Link
Fixing code style in Source{d}
25
Based on the article
“STYLE-ANALYZER: fixing
code style inconsistencies
with interpretable
unsupervised algorithms”
Link
 By Mozilla+Ubisoft
 Searches for suspicious commits
 Based on the publication: “CLEVER: Combining Code
Metrics with Clone Detection for Just-In-Time Fault
Prevention and Resolution in Large Industrial Projects”
Clever-Commit
26
Link
 Java
 By Amazon
 Recommendations on best practices from the
documentation and code base
CodeGuru
27
28
 Analyze code to search for errors
 Analyze code to search for deviations from best
practices
 Analyze artifacts’ code
 Collect metrics and data on code
 Suggest code-style fixes
Main directions
29
 Selected base of open-source repositories
 Dataset selected manually
 Own project base
Ways to learn
30
Problems and pitfalls
31
* in the view of a classic static analyzer developer
How it may look like:
• if (X && A == A)
• if (A + 1 == A + 1)
• if (A[i] == A[i])
• if ((A) == (A))
• …
«Manual» dataset selection
32
We need to find:
if (A == A)
Example from DeepCode
33
«Manual» learning
34
We need to find:
int y = x / 0;
In practice
35
How it may look like:
template <class T> class numeric_limits {
....
}
namespace boost {
....
}
namespace boost {
namespace hash_detail {
template <class T> void dsizet(size_t x) {
size_t length = x / (limits<int>::digits - 31);
}
}
}
@Override
public String getText(Mode mode) {
StringBuilder sb = new StringBuilder();
....
if (filter.getMessage()
.toLowerCase(Locale.ENGLISH)
.startsWith("Each ")) {
sb.append(" has base power and toughness ");
} else {
sb.append(" have base power and toughness ");
}
....
return sb.toString();
}
Data flow analysis
36
Data flow analysis
37
uint32_t* BnNew() {
uint32_t* result = new uint32_t[kBigIntSize];
memset(result, 0, kBigIntSize * sizeof(uint32_t));
return result;
}
std::string AndroidRSAPublicKey(crypto::RSAPrivateKey* key) {
....
uint32_t* n = BnNew();
....
RSAPublicKey pkey;
....
if (pkey.n0inv == 0)
return kDummyRSAPublicKey; // <=
....
}
 «So many projects on GitHub! The analyzer will learn from their
repositories and commits» turns into commits’ collection and
markup.
 If a manually collected learning base is unreliable, what to
expect from an automatically collected one?
Learning on many projects
38
 Check out the commit with the word «fix»:
Learning on many projects
39
 Analyzer has to be up-to-date in terms of the checked
language
 Most projects use outdated standards
 Most projects don’t use new constructions
Outdated code
40
New construction:
std::vector<int> numbers;
....
for (int num : numbers)
foo(num);
New error pattern:
for (int num : numbers)
numbers.push_back(num * 2);
Example
41
Documentation
42
 Code example:
char check(const uint8 *hash_stage2)
{
....
return memcmp(hash_stage2, hash_stage2_reassured,
SHA1_HASH_SIZE);
}
 The analyzer hypothetically suggests to fix as follows:
int check(const uint8 *hash_stage2)
{
....
return memcmp(hash_stage2, hash_stage2_reassured,
SHA1_HASH_SIZE);
}
Why documentation matters
43
Classic approach: documentation
44
Code example:
ObjectOutputStream out = new ObjectOutputStream(....);
SerializedObject obj = new SerializedObject();
obj.state = 100;
out.writeObject(obj);
obj.state = 200;
out.writeObject(obj);
out.close();
Why documentation matters
45
The analyzer suggests:
ObjectOutputStream out = new ObjectOutputStream(....);
SerializedObject obj = new SerializedObject();
obj.state = 100;
out.writeObject(obj);
obj = new SerializedObject(); // Add this line
obj.state = 200;
out.writeObject(obj);
out.close();
Why documentation matters
46
What happens without the edit:
ObjectOutputStream out = new ObjectOutputStream(....);
SerializedObject obj = new SerializedObject();
obj.state = 100;
out.writeObject(obj); // stores the object with the state = 100
obj.state = 200;
out.writeObject(obj); // stores the object with the state = 100
out.close();
Why documentation matters
47
Unambiguous behavior
48
Unambiguous behavior
49
Unambiguous behavior
50
std::vector<int> numbers;
....
for (int num : numbers)
{
if (num < 5)
{
numbers.push_back(0);
break; // or, for example, return
}
}
False positives
51
 Reason for getting a warning may be unclear.
Reason for NOT getting a warning may be unclear as well.
 How to fix?
 Additional learning (will it help?)
 Mechanism to hide warnings (not universal)
False positives
52
In case of successful analyzer learning
53
 Code style by specific symbols
 Collecting additional metrics and information
Promising directions
54
 Best-practices for a specific framework/code base/platform
Promising directions
55
56
https://pvs-studio.com/en/pvs-studio/download/
Download a PVS-Studio one-month trial version and
check your projects using a classic static analysis:
Q&A
viva64.com
57
khanieva@viva64.com

Does static analysis need machine learning?

  • 1.
    Does static analysisneed machine learning? Anti-Talk Victoria Khanieva PVS-Studio
  • 2.
    Speaker 2 Victoria Khanieva • С++developer in PVS-Studio • Supported the MISRA standard • Wrote articles in checks of open-source projects khanieva@viva64.com www.viva64.com
  • 3.
     Introduction tostatic analysis  Existing solutions and approaches they implement  Problems and pitfalls when creating an analyzer:  When learning «manually»  When learning on a real large code base  Most promising approaches Agenda 3
  • 4.
  • 5.
     Code review Typesof code analysis 5
  • 6.
     Code review Dynamic analysis Types of code analysis 6
  • 7.
     Code review Dynamic analysis  Static analysis Types of code analysis 7
  • 8.
     How toreveal errors and flaws in the source code of programs.  Detect errors in programs  Get tips on code formatting  Count metrics  …. Static analysis 8
  • 9.
    void createCube(float halfExtentsX, floathalfExtentsY, float halfExtentsZ, ....){ .... m_model->addVertex(halfExtentsX, halfExtentsY, halfExtentsY, ....); .... } Diagnostics 9
  • 10.
    void createCube(float halfExtentsX, floathalfExtentsY, float halfExtentsZ, ....){ .... m_model->addVertex(halfExtentsX, halfExtentsY, halfExtentsY, ....); .... } Diagnostics 10 V751 Parameter 'halfExtentsZ' is not used inside function body. TinyRenderer.cpp 375
  • 11.
  • 12.
    When ML isuseful 12  Useful: Scanning photos and videos
  • 13.
    When ML isuseful 13  Useful: Scanning photos and videos  Unuseful: Calculator
  • 14.
    When ML isuseful 14  Useful: Scanning photos and videos  Unuseful: Calculator
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
     Java, JS,TS, Python, C, C++  Code review and audit  You can check out demos on an open-source project  Related posts DeepCode 19 Link
  • 20.
  • 21.
     Java, C,C++, Objective-C  By Facebook  Open-source code  You can try Infer on your projects  Based on the Хоара and separation logic, bi-abduction, and the abstract interpretation theory Infer 21 Link
  • 22.
     Handles Inferresults  Suggests possible edits SapFix 22
  • 23.
     Platform toanalyze code quality  System of edits suggestion  Searches for dependencies between functions and methods by NLP Embold 23
  • 24.
     Open-source  Relatedposts  Repository with dataset for learning  Code-style detection  Platform for collecting metrics and statistics Source{d} 24 Link
  • 25.
    Fixing code stylein Source{d} 25 Based on the article “STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms” Link
  • 26.
     By Mozilla+Ubisoft Searches for suspicious commits  Based on the publication: “CLEVER: Combining Code Metrics with Clone Detection for Just-In-Time Fault Prevention and Resolution in Large Industrial Projects” Clever-Commit 26 Link
  • 27.
     Java  ByAmazon  Recommendations on best practices from the documentation and code base CodeGuru 27
  • 28.
  • 29.
     Analyze codeto search for errors  Analyze code to search for deviations from best practices  Analyze artifacts’ code  Collect metrics and data on code  Suggest code-style fixes Main directions 29
  • 30.
     Selected baseof open-source repositories  Dataset selected manually  Own project base Ways to learn 30
  • 31.
    Problems and pitfalls 31 *in the view of a classic static analyzer developer
  • 32.
    How it maylook like: • if (X && A == A) • if (A + 1 == A + 1) • if (A[i] == A[i]) • if ((A) == (A)) • … «Manual» dataset selection 32 We need to find: if (A == A)
  • 33.
  • 34.
  • 35.
    We need tofind: int y = x / 0; In practice 35 How it may look like: template <class T> class numeric_limits { .... } namespace boost { .... } namespace boost { namespace hash_detail { template <class T> void dsizet(size_t x) { size_t length = x / (limits<int>::digits - 31); } } }
  • 36.
    @Override public String getText(Modemode) { StringBuilder sb = new StringBuilder(); .... if (filter.getMessage() .toLowerCase(Locale.ENGLISH) .startsWith("Each ")) { sb.append(" has base power and toughness "); } else { sb.append(" have base power and toughness "); } .... return sb.toString(); } Data flow analysis 36
  • 37.
    Data flow analysis 37 uint32_t*BnNew() { uint32_t* result = new uint32_t[kBigIntSize]; memset(result, 0, kBigIntSize * sizeof(uint32_t)); return result; } std::string AndroidRSAPublicKey(crypto::RSAPrivateKey* key) { .... uint32_t* n = BnNew(); .... RSAPublicKey pkey; .... if (pkey.n0inv == 0) return kDummyRSAPublicKey; // <= .... }
  • 38.
     «So manyprojects on GitHub! The analyzer will learn from their repositories and commits» turns into commits’ collection and markup.  If a manually collected learning base is unreliable, what to expect from an automatically collected one? Learning on many projects 38
  • 39.
     Check outthe commit with the word «fix»: Learning on many projects 39
  • 40.
     Analyzer hasto be up-to-date in terms of the checked language  Most projects use outdated standards  Most projects don’t use new constructions Outdated code 40
  • 41.
    New construction: std::vector<int> numbers; .... for(int num : numbers) foo(num); New error pattern: for (int num : numbers) numbers.push_back(num * 2); Example 41
  • 42.
  • 43.
     Code example: charcheck(const uint8 *hash_stage2) { .... return memcmp(hash_stage2, hash_stage2_reassured, SHA1_HASH_SIZE); }  The analyzer hypothetically suggests to fix as follows: int check(const uint8 *hash_stage2) { .... return memcmp(hash_stage2, hash_stage2_reassured, SHA1_HASH_SIZE); } Why documentation matters 43
  • 44.
  • 45.
    Code example: ObjectOutputStream out= new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); obj.state = 200; out.writeObject(obj); out.close(); Why documentation matters 45
  • 46.
    The analyzer suggests: ObjectOutputStreamout = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); obj = new SerializedObject(); // Add this line obj.state = 200; out.writeObject(obj); out.close(); Why documentation matters 46
  • 47.
    What happens withoutthe edit: ObjectOutputStream out = new ObjectOutputStream(....); SerializedObject obj = new SerializedObject(); obj.state = 100; out.writeObject(obj); // stores the object with the state = 100 obj.state = 200; out.writeObject(obj); // stores the object with the state = 100 out.close(); Why documentation matters 47
  • 48.
  • 49.
  • 50.
  • 51.
    std::vector<int> numbers; .... for (intnum : numbers) { if (num < 5) { numbers.push_back(0); break; // or, for example, return } } False positives 51
  • 52.
     Reason forgetting a warning may be unclear. Reason for NOT getting a warning may be unclear as well.  How to fix?  Additional learning (will it help?)  Mechanism to hide warnings (not universal) False positives 52
  • 53.
    In case ofsuccessful analyzer learning 53
  • 54.
     Code styleby specific symbols  Collecting additional metrics and information Promising directions 54
  • 55.
     Best-practices fora specific framework/code base/platform Promising directions 55
  • 56.
    56 https://pvs-studio.com/en/pvs-studio/download/ Download a PVS-Studioone-month trial version and check your projects using a classic static analysis:
  • 57.