As UX researchers, we often encounter a common challenge: deciding whether one design truly outperforms another. Maybe one version of an interface feels faster or looks cleaner. But how do we know if those differences are meaningful - or just the result of chance? To answer that, we turn to statistical comparisons. When comparing numeric metrics like task time or SUS scores, one of the first decisions is whether you’re working with the same users across both designs or two separate groups. If it's the same users, a paired t-test helps isolate the design effect by removing between-subject variability. For independent groups, a two-sample t-test is appropriate, though it requires more participants to detect small effects due to added variability. Binary outcomes like task success or conversion are another common case. If different users are tested on each version, a two-proportion z-test is suitable. But when the same users attempt tasks under both designs, McNemar’s test allows you to evaluate whether the observed success rates differ in a meaningful way. Task time data in UX is often skewed, which violates assumptions of normality. A good workaround is to log-transform the data before calculating confidence intervals, and then back-transform the results to interpret them on the original scale. It gives you a more reliable estimate of the typical time range without being overly influenced by outliers. Statistical significance is only part of the story. Once you establish that a difference is real, the next question is: how big is the difference? For continuous metrics, Cohen’s d is the most common effect size measure, helping you interpret results beyond p-values. For binary data, metrics like risk difference, risk ratio, and odds ratio offer insight into how much more likely users are to succeed or convert with one design over another. Before interpreting any test results, it’s also important to check a few assumptions: are your groups independent, are the data roughly normal (or corrected for skew), and are variances reasonably equal across groups? Fortunately, most statistical tests are fairly robust, especially when sample sizes are balanced. If you're working in R, I’ve included code in the carousel. This walkthrough follows the frequentist approach to comparing designs. I’ll also be sharing a follow-up soon on how to tackle the same questions using Bayesian methods.
Analyzing Statistical Significance In Experiments
Explore top LinkedIn content from expert professionals.
Summary
Analyzing statistical significance in experiments involves determining whether observed differences or results in data are meaningful or simply due to random chance. It helps researchers and analysts make informed decisions backed by data rather than assumptions.
- Understand the tests: Choose the correct statistical test based on the type of data and design of your experiment, such as paired t-tests for the same users or two-proportion z-tests for independent groups.
- Interpret p-values carefully: A p-value below 0.05 indicates statistical significance, but it doesn’t guarantee practical relevance or prove effectiveness; always consider the broader context.
- Account for uncertainty: Focus on confidence intervals and the range of possible effects rather than relying solely on mean estimates to avoid misleading conclusions.
-
-
Me, watching someone misdescribe p-values at a conference ……Do you think you can pass the P-value explanation test❓. First A p-value is → Not a badge of truth or a certificate of real-world impact ➊ 𝗔 𝗽-𝘃𝗮𝗹𝘂𝗲 𝗶𝘀 → The probability of observing results as extreme (or more extreme) as yours → G𝗶𝘃𝗲𝗻 𝘁𝗵𝗮𝘁 𝘁𝗵𝗲 𝗻𝘂𝗹𝗹 𝗵𝘆𝗽𝗼𝘁𝗵𝗲𝘀𝗶𝘀 𝗶𝘀 𝘁𝗿𝘂𝗲 ————————— For example: ➋ A 𝗽-𝘃𝗮𝗹𝘂𝗲 𝗼𝗳 𝟬.𝟬𝟯 𝗱𝗼𝗲𝘀 𝗻𝗼𝘁 𝗺𝗲𝗮𝗻: → “My intervention worked” → “There’s a 97% chance the null is false” → “We’ve found definitive proof” Instead… → It means there’s a 3% chance that you would observe results this strong (or stronger) if there were truly no effect. ➌ 𝗪𝗵𝘆 𝗱𝗼𝗲𝘀 𝟬.𝟬𝟱 𝗺𝗮𝘁𝘁𝗲𝗿? → In research, we often use 0.05 as a conventional cutoff for statistical significance → If your p-value is less than 0.05, we say the result is “statistically significant” → This means: it’s unlikely the observed results happened by chance under the null BUT → Statistical significance ≠ practical relevance → p < 0.05 doesn’t mean “definitely effective” → And p > 0.05 doesn’t mean “no effect at all” ➍ 𝗧𝗵𝗶𝘀 𝗶𝘀 𝘄𝗵𝗲𝗿𝗲 𝗺𝗼𝘀𝘁 𝗴𝗲𝘁 𝗶𝘁 𝘄𝗿𝗼𝗻𝗴 → They treat p-values as a truth switch: “Yes” or “No” → But statistics is nuance. Interpretation matters. → And misrepresenting the basics undermines public trust in science. ————————- 💬 Have you seen p-values misused or misunderstood in public discussions? ♻️ Repost to help raise the bar for statistical literacy in public health. #StatisticalThinking #PValueMyths
-
“I ran an experiment showing positive lift but didn’t see the results in the bottom line.” I think we’ve all had this experience: We set up a nice, clean A/B test to check the value of a feature or a creative. We get the results back: 5% lift, statistically significant. Nice! Champagne bottle pops, etc., etc. Since we got the win, we bake the 5% lift into our forecast for next quarter when the feature will roll out to the entire customer base and we sit back to watch the money roll in. But then, shockingly, we do not actually see that lift. When we look at our overall metrics we may see a very slight lift around when the feature got rolled out, but then it goes back down and it seems like it could just be noise anyway. Since we had baked our 5% lift into our forecast, and we definitely don’t have the 5% lift, we’re in trouble. What happened? The big issue here is that we didn’t consider uncertainty. When interpreting the results of our A/B test, we said “It’s a 5% lift, statistically significant” which implies something like “It’s definitely a 5% lift”. Unfortunately, this is not the right interpretation. The right interpretation is: “There was a statistically significant positive (i.e., >0) lift, with a mean estimate of 5%, but the experiment is consistent with a lift result ranging from 0.001% to 9.5%”. Because of well-known biases associated with this type of null-hypothesis testing, it’s most likely that the actual result was some very small positive lift, but our test just didn’t have enough statistical power to narrow the uncertainty bounds very much. So, what does this mean? When you’re doing any type of experimentation, you need to be looking at the uncertainty intervals from the test. You should never just report out the mean estimate from the test and say that’s “statistically significant”. Instead, you should always report out the range of metrics that are compatible with the experiment. When actually interpreting those results in a business context, you generally want to be conservative and assume the actual results will come in on the low end of the estimate from the test, or if it’s mission-critical then design a test with more statistical power to confirm the result. If you just look at the mean results from your test, you are highly likely to be led astray! You should always be looking first at the range of the uncertainty interval and only checking the mean last. To learn more about Recast, you can check us out here: https://lnkd.in/e7BKrBf4