Debugging Intermittent Issues
Lloyd Moore, President
Lloyd@CyberData-Robotics.com
www.CyberData-Robotics.com
Northwest C++ Users Group, January 2017
Agenda
 The problem with intermittent issues
 Getting a baseline
 Basic statistics
 Making a change
 Is it fixed?
 Dealing with multi-causal issues
Problem with Intermittent Issues
 Fundamentally you do not know what is causing
the problem, therefore you cannot control for it
 You never know from one run to the next if the
issue will show up
 From a single run you cannot say anything, so
you will need multiple runs
 Once you have multiple runs you need statistics
to identify and describe any change in behavior
Getting a Baseline
 A baseline is a reference point or configuration
where you control any factor that might be
related to the issue
 Since you really don't know what is causing the
issue – you likely won't actually control for it, but
at least you minimize the number of variables
affecting future measurements
 Ideally the goal is to simply control as much of
the system as possible
Key Point
 Intermittent failures are NOT really
intermittent!!!
 They are caused by some varying condition that
you are not measuring, observing, and
controlling for
 If the varying condition happens in a particular
way – you get a failure – every time
 The debugging problem is to find the condition
that is varying, quantify how varies then control
it
Getting a Baseline
 What you need to control will vary based on
what you are debugging
 “Pure software” based issues will often be
easier to control than “network connected”
issues or “hardware related” issues
 Note that “pure software” issues are also far
less likely to be intermittent as there are far
fewer sources of randomness
 Most notable pure software randomness is system
loading leading to race conditions
Items to look at controlling
 Version of the software you are running
 Physical location of the target system
 Other processes running on the target system
 Network traffic to the target system – ideally
connect the system to a wired network
connection on an isolated network
 Other processes, and response time, of
remotely connected systems that the target
system depends on
More items to control
 Data being manipulated by the system
 State of the target system – does each run of
the system begin in the same state, including
existence and size of debugging files
 Depending on what is going on ANYTHING
could affect the issue: lighting, time of day,
people in the room, temperature, be creative
 May not get everything on the first pass – for
really hard problems establishing a baseline
becomes iterative
Also Control Yourself
 As you work the problem and repeatedly go
through the test case you will unconsciously
change your behavior
 Humans naturally learn to avoid failure
conditions, and this includes behavior patterns
that show up bugs
 Result can be VERY subtle, even something
like changing your keystroke rate slightly, and
this can affect your testing
Can't Control It – Measure it
 Many times you will not be able to control a
factor that may be important – time of day for
example
 If you cannot control a factor make every
attempt to measure and record the value during
your test runs
 Often times after hours of testing and looking at
something else, a pattern will emerge in the
collected data that strongly hints at the real
issue
Baseline Failure Rate
 Part of the baseline configuration is the rate at
which you experience the failure condition
 Once you have the test environment as stable
and quantified as possible you need to run the
system multiple times and quantify how often
you see the failure
 Three basic outcomes here:
 The failure no longer occurs
 The failure now happens all the time
 The failure happens X of Y trials
Baseline Failure Rate
 If you no longer have an intermittent failure (no
failure or constant failure) something you
changed in setting up the baseline is likely
related to the issue – this is good, you can now
systemically vary the baseline setup to debug
 Most likely you will still get some failures, for the
next set of slides we will assume that you have
run the system 10 times and saw 3 failures.....
Basic Statistics
 3 failures in 10 runs is: 3/10 = 0.3 = 30% failure
rate
 Success rate is: 1.0 – 0.3 = 0.7 = 70%
 Note that this number is an APPROXIMATION –
you can really never know the actual rates
 What is the chance of testing 10 times and not
seeing a single failure:
 Chance of success on first try * chance of success
on second try * chance ….
 0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
How many times to test?
 So if we CAN run the test 10 times and not see
a failure how many times do we really need to
test?
 Yes there is a formula to estimate this – have
never actually used this in practice as several of
the variables involved are pretty hard to
estimate – things like degrees of freedom
 Simply use a rule of thumb here – want to get
the chance of not seeing a failure when there
really is one below either 5% or 1%
How many times to test?
 If I want to get below 1% will 15 trials be
enough?
 0.70^15 = 0.0047 = 0.47% - Yep, and this approach
is good enough for day to day work
 Those that are more math inclined could also solve
the equation: 0.01 = 0.70^X , then of course make
sure to take the next full integer! (Hint: Need logb ar
= r logb a)
OK now what?
 We have a controlled test set up
 We have an approximation of the failure rate
 We may also have some clues on what may be
causing the problem
 Next step is to attempt a change and observe
the failure rate
Is it fixed?
 I made a change and now I don't see the failure
condition any longer – I'm done right?
 Hum – NO!
 From the previous slide:
 What is the chance of testing 10 times and not
seeing a single failure:
 Chance of success on first try * chance of success
on second try * chance ….
 0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
Is it fixed?
 One issue with problems that display
intermittently is that you can never really know
for sure if they are fixed
 The best you can do is estimate the probability
that they are fixed and decide if that is good
enough for your application
 You can always run more trials to increase the
certainty that you really fixed the problem
Is it fixed?
 Assume our original 30% failure rate, 70%
success rate:
Trials: Formula: Chance of not
seeing failure:
10 0.7^10 2.82%
15 0.7^15 0.474%
20 0.7^20 0.079%
25 0.7^25 0.013%
Multi-causal Issues
 I made a change and my failure rate decreased,
but not to zero, now what?
 Very good indication that you have either
affected the original problem or have actually
solved one problem but another remains
 The key to sorting this out is to look for details
on the actual failure
 Often times multiple issues have a similar
looking failure, but are really two different things
Multi-causal Issues
 Attempt to get more information on the exact
nature of the remaining failure, things to look at:
 Stack traces
 Timing values, how long into the run?
 Variation in results – cluster plots helpful here
 More than one trigger for the failure
 Many times you will stumble upon this
information while debugging the original issue
 Very common for intermittent issues to be multi-
causal as are ignored when first seen
Multi-causal Issues
 If you suspect you have more than one issue
setup a separate debugging environment for
each suspected issue
 You may also be able to make a list of the
separate failure cases and failure rates,
decomposing the original numbers
 Helpful to remember the original failure rate is
simply the sum of the individual failure rates
 Work the issue with the highest failure rate first
 Keep working down the list until all are solved
Summary
 Intermittent issues are not really intermittent
 Need to track down what unknown variable is
changing and handle that variation
 A baseline system configuration with failure
rates is key to telling if a change occured
 Just because you don't see the failure any more
is NOT a guarantee that it is fixed
 Multi-causal issues can be separated and
worked as individual issues once you have
details on the failure
Questions?

Debugging Intermittent Issues - A How To

  • 1.
    Debugging Intermittent Issues LloydMoore, President Lloyd@CyberData-Robotics.com www.CyberData-Robotics.com Northwest C++ Users Group, January 2017
  • 2.
    Agenda  The problemwith intermittent issues  Getting a baseline  Basic statistics  Making a change  Is it fixed?  Dealing with multi-causal issues
  • 3.
    Problem with IntermittentIssues  Fundamentally you do not know what is causing the problem, therefore you cannot control for it  You never know from one run to the next if the issue will show up  From a single run you cannot say anything, so you will need multiple runs  Once you have multiple runs you need statistics to identify and describe any change in behavior
  • 4.
    Getting a Baseline A baseline is a reference point or configuration where you control any factor that might be related to the issue  Since you really don't know what is causing the issue – you likely won't actually control for it, but at least you minimize the number of variables affecting future measurements  Ideally the goal is to simply control as much of the system as possible
  • 5.
    Key Point  Intermittentfailures are NOT really intermittent!!!  They are caused by some varying condition that you are not measuring, observing, and controlling for  If the varying condition happens in a particular way – you get a failure – every time  The debugging problem is to find the condition that is varying, quantify how varies then control it
  • 6.
    Getting a Baseline What you need to control will vary based on what you are debugging  “Pure software” based issues will often be easier to control than “network connected” issues or “hardware related” issues  Note that “pure software” issues are also far less likely to be intermittent as there are far fewer sources of randomness  Most notable pure software randomness is system loading leading to race conditions
  • 7.
    Items to lookat controlling  Version of the software you are running  Physical location of the target system  Other processes running on the target system  Network traffic to the target system – ideally connect the system to a wired network connection on an isolated network  Other processes, and response time, of remotely connected systems that the target system depends on
  • 8.
    More items tocontrol  Data being manipulated by the system  State of the target system – does each run of the system begin in the same state, including existence and size of debugging files  Depending on what is going on ANYTHING could affect the issue: lighting, time of day, people in the room, temperature, be creative  May not get everything on the first pass – for really hard problems establishing a baseline becomes iterative
  • 9.
    Also Control Yourself As you work the problem and repeatedly go through the test case you will unconsciously change your behavior  Humans naturally learn to avoid failure conditions, and this includes behavior patterns that show up bugs  Result can be VERY subtle, even something like changing your keystroke rate slightly, and this can affect your testing
  • 10.
    Can't Control It– Measure it  Many times you will not be able to control a factor that may be important – time of day for example  If you cannot control a factor make every attempt to measure and record the value during your test runs  Often times after hours of testing and looking at something else, a pattern will emerge in the collected data that strongly hints at the real issue
  • 11.
    Baseline Failure Rate Part of the baseline configuration is the rate at which you experience the failure condition  Once you have the test environment as stable and quantified as possible you need to run the system multiple times and quantify how often you see the failure  Three basic outcomes here:  The failure no longer occurs  The failure now happens all the time  The failure happens X of Y trials
  • 12.
    Baseline Failure Rate If you no longer have an intermittent failure (no failure or constant failure) something you changed in setting up the baseline is likely related to the issue – this is good, you can now systemically vary the baseline setup to debug  Most likely you will still get some failures, for the next set of slides we will assume that you have run the system 10 times and saw 3 failures.....
  • 13.
    Basic Statistics  3failures in 10 runs is: 3/10 = 0.3 = 30% failure rate  Success rate is: 1.0 – 0.3 = 0.7 = 70%  Note that this number is an APPROXIMATION – you can really never know the actual rates  What is the chance of testing 10 times and not seeing a single failure:  Chance of success on first try * chance of success on second try * chance ….  0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
  • 14.
    How many timesto test?  So if we CAN run the test 10 times and not see a failure how many times do we really need to test?  Yes there is a formula to estimate this – have never actually used this in practice as several of the variables involved are pretty hard to estimate – things like degrees of freedom  Simply use a rule of thumb here – want to get the chance of not seeing a failure when there really is one below either 5% or 1%
  • 15.
    How many timesto test?  If I want to get below 1% will 15 trials be enough?  0.70^15 = 0.0047 = 0.47% - Yep, and this approach is good enough for day to day work  Those that are more math inclined could also solve the equation: 0.01 = 0.70^X , then of course make sure to take the next full integer! (Hint: Need logb ar = r logb a)
  • 16.
    OK now what? We have a controlled test set up  We have an approximation of the failure rate  We may also have some clues on what may be causing the problem  Next step is to attempt a change and observe the failure rate
  • 17.
    Is it fixed? I made a change and now I don't see the failure condition any longer – I'm done right?  Hum – NO!  From the previous slide:  What is the chance of testing 10 times and not seeing a single failure:  Chance of success on first try * chance of success on second try * chance ….  0.7 * 0.7.... = 0.7^10 = 0.028 = 2.8%
  • 18.
    Is it fixed? One issue with problems that display intermittently is that you can never really know for sure if they are fixed  The best you can do is estimate the probability that they are fixed and decide if that is good enough for your application  You can always run more trials to increase the certainty that you really fixed the problem
  • 19.
    Is it fixed? Assume our original 30% failure rate, 70% success rate: Trials: Formula: Chance of not seeing failure: 10 0.7^10 2.82% 15 0.7^15 0.474% 20 0.7^20 0.079% 25 0.7^25 0.013%
  • 20.
    Multi-causal Issues  Imade a change and my failure rate decreased, but not to zero, now what?  Very good indication that you have either affected the original problem or have actually solved one problem but another remains  The key to sorting this out is to look for details on the actual failure  Often times multiple issues have a similar looking failure, but are really two different things
  • 21.
    Multi-causal Issues  Attemptto get more information on the exact nature of the remaining failure, things to look at:  Stack traces  Timing values, how long into the run?  Variation in results – cluster plots helpful here  More than one trigger for the failure  Many times you will stumble upon this information while debugging the original issue  Very common for intermittent issues to be multi- causal as are ignored when first seen
  • 22.
    Multi-causal Issues  Ifyou suspect you have more than one issue setup a separate debugging environment for each suspected issue  You may also be able to make a list of the separate failure cases and failure rates, decomposing the original numbers  Helpful to remember the original failure rate is simply the sum of the individual failure rates  Work the issue with the highest failure rate first  Keep working down the list until all are solved
  • 23.
    Summary  Intermittent issuesare not really intermittent  Need to track down what unknown variable is changing and handle that variation  A baseline system configuration with failure rates is key to telling if a change occured  Just because you don't see the failure any more is NOT a guarantee that it is fixed  Multi-causal issues can be separated and worked as individual issues once you have details on the failure
  • 24.