Debugging Intermittent Issues - A How To

Debugging Intermittent Issues
Lloyd Moore, President
Lloyd@CyberData-Robotics.com
www.CyberData-Robotics.com
Northwest C++ Users Group, January 2017

Agenda
 The problem with intermittent issues
 Getting a baseline
 Basic statistics
 Making a change
 Is it fixed?
 Dealing with multi-causal issues

Problem with Intermittent Issues
 Fundamentally you do not know what is causing
the problem, therefore you cannot control for it
 You never know from one run to the next if the
issue will show up
 From a single run you cannot say anything, so
you will need multiple runs
 Once you have multiple runs you need statistics
to identify and describe any change in behavior

Getting a Baseline
 A baseline is a reference point or configuration
where you control any factor that might be
related to the issue
 Since you really don't know what is causing the
issue – you likely won't actually control for it, but
at least you minimize the number of variables
affecting future measurements
 Ideally the goal is to simply control as much of
the system as possible

Key Point
 Intermittent failures are NOT really
intermittent!!!
 They are caused by some varying condition that
you are not measuring, observing, and
controlling for
 If the varying condition happens in a particular
way – you get a failure – every time
 The debugging problem is to find the condition
that is varying, quantify how varies then control
it

Getting a Baseline
 What you need to control will vary based on
what you are debugging
 “Pure software” based issues will often be
easier to control than “network connected”
issues or “hardware related” issues
 Note that “pure software” issues are also far
less likely to be intermittent as there are far
fewer sources of randomness
 Most notable pure software randomness is system
loading leading to race conditions

Items to look at controlling
 Version of the software you are running
 Physical location of the target system
 Other processes running on the target system
 Network traffic to the target system – ideally
connect the system to a wired network
connection on an isolated network
 Other processes, and response time, of
remotely connected systems that the target
system depends on

More items to control
 Data being manipulated by the system
 State of the target system – does each run of
the system begin in the same state, including
existence and size of debugging files
 Depending on what is going on ANYTHING
could affect the issue: lighting, time of day,
people in the room, temperature, be creative
 May not get everything on the first pass – for
really hard problems establishing a baseline
becomes iterative

Also Control Yourself
 As you work the problem and repeatedly go
through the test case you will unconsciously
change your behavior
 Humans naturally learn to avoid failure
conditions, and this includes behavior patterns
that show up bugs
 Result can be VERY subtle, even something
like changing your keystroke rate slightly, and
this can affect your testing

Can't Control It – Measure it
 Many times you will not be able to control a
factor that may be important – time of day for
example
 If you cannot control a factor make every
attempt to measure and record the value during
your test runs
 Often times after hours of testing and looking at
something else, a pattern will emerge in the
collected data that strongly hints at the real
issue

Baseline Failure Rate
 Part of the baseline configuration is the rate at
which you experience the failure condition
 Once you have the test environment as stable
and quantified as possible you need to run the
system multiple times and quantify how often
you see the failure
 Three basic outcomes here:
 The failure no longer occurs
 The failure now happens all the time
 The failure happens X of Y trials

Baseline Failure Rate
 If you no longer have an intermittent failure (no
failure or constant failure) something you
changed in setting up the baseline is likely
related to the issue – this is good, you can now
systemically vary the baseline setup to debug
 Most likely you will still get some failures, for the
next set of slides we will assume that you have
run the system 10 times and saw 3 failures.....

Basic Statistics
 3 failures in 10 runs is: 3/10 = 0.3 = 30% failure
rate
 Success rate is: 1.0 – 0.3 = 0.7 = 70%
 Note that this number is an APPROXIMATION –
you can really never know the actual rates
 What is the chance of testing 10 times and not
seeing a single failure:
 Chance of success on first try * chance of success
on second try * chance ….
 0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%

How many times to test?
 So if we CAN run the test 10 times and not see
a failure how many times do we really need to
test?
 Yes there is a formula to estimate this – have
never actually used this in practice as several of
the variables involved are pretty hard to
estimate – things like degrees of freedom
 Simply use a rule of thumb here – want to get
the chance of not seeing a failure when there
really is one below either 5% or 1%

How many times to test?
 If I want to get below 1% will 15 trials be
enough?
 0.70^15 = 0.0047 = 0.47% - Yep, and this approach
is good enough for day to day work
 Those that are more math inclined could also solve
the equation: 0.01 = 0.70^X , then of course make
sure to take the next full integer! (Hint: Need logb ar
= r logb a)

OK now what?
 We have a controlled test set up
 We have an approximation of the failure rate
 We may also have some clues on what may be
causing the problem
 Next step is to attempt a change and observe
the failure rate

Is it fixed?
 I made a change and now I don't see the failure
condition any longer – I'm done right?
 Hum – NO!
 From the previous slide:
 What is the chance of testing 10 times and not
seeing a single failure:
 Chance of success on first try * chance of success
on second try * chance ….
 0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%

Is it fixed?
 One issue with problems that display
intermittently is that you can never really know
for sure if they are fixed
 The best you can do is estimate the probability
that they are fixed and decide if that is good
enough for your application
 You can always run more trials to increase the
certainty that you really fixed the problem

Is it fixed?
 Assume our original 30% failure rate, 70%
success rate:
Trials: Formula: Chance of not
seeing failure:
10 0.7^10 2.82%
15 0.7^15 0.474%
20 0.7^20 0.079%
25 0.7^25 0.013%

Multi-causal Issues
 I made a change and my failure rate decreased,
but not to zero, now what?
 Very good indication that you have either
affected the original problem or have actually
solved one problem but another remains
 The key to sorting this out is to look for details
on the actual failure
 Often times multiple issues have a similar
looking failure, but are really two different things

Multi-causal Issues
 Attempt to get more information on the exact
nature of the remaining failure, things to look at:
 Stack traces
 Timing values, how long into the run?
 Variation in results – cluster plots helpful here
 More than one trigger for the failure
 Many times you will stumble upon this
information while debugging the original issue
 Very common for intermittent issues to be multi-
causal as are ignored when first seen

Multi-causal Issues
 If you suspect you have more than one issue
setup a separate debugging environment for
each suspected issue
 You may also be able to make a list of the
separate failure cases and failure rates,
decomposing the original numbers
 Helpful to remember the original failure rate is
simply the sum of the individual failure rates
 Work the issue with the highest failure rate first
 Keep working down the list until all are solved

Summary
 Intermittent issues are not really intermittent
 Need to track down what unknown variable is
changing and handle that variation
 A baseline system configuration with failure
rates is key to telling if a change occured
 Just because you don't see the failure any more
is NOT a guarantee that it is fixed
 Multi-causal issues can be separated and
worked as individual issues once you have
details on the failure

Debugging Intermittent Issues - A How To

More Related Content

Similar to Debugging Intermittent Issues - A How To

More from LloydMoore

Recently uploaded

Debugging Intermittent Issues - A How To