How I would visualize a regression: I wouldn’t. IMO, it’s better to visualize the *takeaway* than the analysis. Here’s what I mean: If you know how to run a regression, you were probably taught that there are two “right” ways to report the results: - A regression table - A regression plot Perfect for academic publications or audiences, not so great for business leaders and stakeholders. While regressions tell you the direction and strength of a relationship, they rarely make the insights tangible or compelling. Let's say your organization faces an attrition challenge. Your regression analysis shows employee engagement is the strongest predictor of retention sentiment. You want leaders to prioritize engagement initiatives. You could show them your regression plot... but I doubt you'll change any minds. Instead, visualize the takeaway. The process: ➤ Conduct analysis: Run your regression to identify root causes ➤ Extract key insight: Employee engagement drives retention ➤ Make it tangible: Transform your insight into concrete metrics (e.g., retention sentiment by engagement level) ➤ Design your visualization: Create one powerful visual showing what's at stake The outcome: A simple chart showing disengaged employees are 5x more likely to be flight risks than engaged employees. Is it less technical than the regression plot? Yes. Is it an oversimplification? To some extent. But it delivers the message leaders need to hear, and it’s still grounded in robust statistical analysis. I'm curious: How would you share regression findings? —— ♻️ Repost to help your network. 👋🏼 I’m Morgan. I share my favorite data viz and data storytelling tips to help other analysts (and academics) better communicate their work.
Data Analysis Techniques in R
Explore top LinkedIn content from expert professionals.
-
-
🟠 Most data scientists (and test managers) think explaining A/B test results is about throwing p-values and confidence intervals at stakeholders... I've sat through countless meetings where the room goes silent the moment a technical slide appears. Including mine. You know the moment when "statistical significance" and "confidence intervals" flash on screen, and you can practically hear crickets 🦗 It's not that stakeholders aren't smart. We are just speaking different languages. Impactful data people uses completely opposite approach. --- Start with the business question --- ❌ "Our test showed a statistically significant 2.3% lift..." ✅ "You asked if we should roll out the new recommendation model..." This creates anticipation and you may see the stakeholder lean forward. --- Size the real impact --- ❌ "p-value is 0.001 with 95% confidence..." ✅ "This change would bring in ~$2.4M annually, based on current traffic..." Numbers without context are just math. They can be in appendix or footnotes. Numbers tied to business outcomes are insights. These should be front and center. --- Every complex idea has a simple analogy --- ❌ "Our sample suffers from selection bias..." ✅ "It's like judging an e-commerce feature by only looking at users who completed a purchase..." --- Paint the full picture. Every business decision has tradeoffs --- ❌ "The test won", then end presentation ✅ Show the complete story - what we gained, what we lost, what we're still unsure about, what to watch post-launch, etc. --- This one is most important --- ✅ Start with the decision they need to make. Then only present the data that helps make **that** decision. Everything else is noise. The core principle at work? Think like a business leader who happens to know data science. Not a data scientist who happens to work in business. This shift in mindset changes everything. Are you leading experimentation at your company? Or wrestling with translating complex analyses into clear recommendations? I've been there. For 16 long years. In the trenches. Now I'm helping fellow data practitioners unlearn the jargon and master the art of influence through data. Because let's be honest - the hardest part of our job isn't running the analysis. It's getting others to actually use it.
-
The pitfalls of class prediction in omics 🧵 1/ You think you’ve built the perfect omics predictor. The accuracy is high. The p-value is low. But is it real—or just a story your data whispered back? 2/ High-dimensional data is a double-edged sword. Thousands of genes, hundreds of samples. With enough features, even random noise can look predictive. That’s the curse of dimensionality. 3/ Statistically, you can always draw a hyperplane to separate two classes. Even if the labels are random. That’s overfitting. And omics is a playground for it. 4/ So we add regularization: LASSO, Ridge, Elastic Net. They penalize complexity, reward simplicity. But it’s not enough. Because the real danger is how we validate. 5/ Cross-validation (CV) is standard. But do you select your features before the CV folds? That’s data leakage. And it gives inflated performance. Always. Every time. 6/ Nested CV is your friend: Inner loop: tune hyperparameters Outer loop: estimate error It’s slower. But it’s honest. 7/ Still confident? Let’s talk confounders. Batch effects. Age. Ethnicity. Study site. If they’re correlated with outcome, they fake predictive power. 8/ Confounding doesn’t go away with random splits. It hides in the noise. And only shows itself when your model fails in an external dataset. Validate on independent cohorts. 9/ Want to compare your model to an existing one? Do it on a neutral dataset. Using your training set to favor your model is bias by design. 10/ So you beat the baseline by 2%. Is it statistically significant? Not unless you test it across multiple datasets. Ideally 5–6. Meta-analysis helps. 11/ Unsupervised pitfalls: If you cluster samples using features chosen with the labels in mind, You’ll rediscover your labels—not biology. Clustering must be unsupervised in every way. 12/ Many retractions in omics come from these mistakes: Data leakage Confounding Overfitting Unvalidated results Because story-telling is easier than science. 13/ To get it right, you need more than code. You need humility. Statistical discipline. Curated metadata. And rock-solid validation. 14/ Key takeaways: Overfitting loves high dimensions Never pre-select features across folds Use nested CV Validate externally Watch for confounders Simplicity > complexity 15/ Omics is powerful. But power needs control. Guard your models from yourself. The truth is out there—but only if you earn it. I hope you've found this post helpful. Follow me for more. Subscribe to my FREE newsletter chatomics to learn bioinformatics https://lnkd.in/erw83Svn
-
One major issue with Data Science is that, in the real world, if you have two teams competing to build some model and judge them based on some arbitrary metric like Precision or Accuracy or RSME, it's very likely that the winning team will build a model that fails once it goes into production. This is entirely due to data leakage, which is quite common, even in published PhD papers, but it's really hard to know if you have a data leakage problem in your dataset until you put your model in production. There are, however, a few things you can do to mitigate this problem. 1. Be suspicious. If your model behaves well, assume it's because of data leakage first. That should be your default hypothesis. 2. Know what every single variable you throw into your model means, how it was collected, and how it was calculated. 3. Use SHAP values in every project. If one column (or a collection of columns derived from that one column) shows a very high SHAP value compared to everything else, assume it's a target leakage problem (where information about your target variable entered the system, like future sales) and investigate. 4. Build models consisting only of variables you absolutely are sure do not have data leakage first. 5. Think very carefully about your cross-validation strategy. Doing out-of-the-box cross validation out of habit often introduces data leakage. 6. Rigorously test the model on data it's never seen before (i.e. data that was never used to train OR score the model). 7. Always do data-preprocessing and featurization after you split the data never before, i.e. don't impute means on the whole dataset first. 8. Only use data that would be available at the time you'd want to predict your target, so don't use data like November GDP to predict something in November because it's not released until mid-December. 9. Avoid identical or nigh-identical rows in train and test, as your model will memorize rather than generalize. 10. Correlate your variables with the target variable at the onset of your project and investigate variables that are highly correlated for target leakage. #datascience #datascientist #machinelearning #dataleakage #ai
-
Imagine you've performed an in-depth analysis and uncovered an incredible insight. You’re now excited to share your findings with an influential group of stakeholders. You’ve been meticulous, eliminating biases, double-checking your logic, and ensuring your conclusions are sound. But even with all this diligence, there’s one common pitfall that could diminish the impact of your insights: information overload. In our excitement, we sometimes flood stakeholders with excessive details, dense reports, cluttered dashboards, and long presentations filled with too much information. The result is confusion, disengagement, and inaction. Insights are not our children, we don’t have to love them equally. To truly drive action, we must isolate and emphasize the insights that matter most—those that directly address the problem statement and have the highest impact. Here’s how to present insights effectively to ensure clarity, engagement, and action: ✅ Start with the Problem – Frame your insights around the problem statement. If stakeholders don’t see the relevance, they won’t care about the data. ✅ Prioritize Key Insights – Not all insights are created equal. Share only the most impactful findings that directly influence decision-making. ✅ Tell a Story, Not Just Show Data– Structure your presentation as a narrative: What was the challenge? What did the data reveal? What should be done next? A well-crafted story is more memorable than a raw data dump. ✅ Use Clean, Intuitive Visuals – Data-heavy slides and cluttered dashboards overwhelm stakeholders. Use simple, insightful charts that highlight key takeaways at a glance. ✅ Make Your Recommendations Clear– Insights without action are meaningless. End with specific, actionable recommendations to guide decision-making. ✅ Encourage Dialogue, Not Just Presentation – Effective communication is a two-way street. Invite questions and discussions to ensure buy-in from stakeholders. ✅ Less is More– Sometimes, one well-presented insight can be more powerful than ten slides of analysis. Keep it concise, impactful, and decision-focused. Before presenting, ask yourself: Am I providing clarity or creating confusion? The best insights don’t just inform—they inspire action. What strategies do you use to make your insights more actionable? Let’s discuss! P.S: I've shared a dashboard I reviewed recently, and thought it was overloaded and not actionably created
-
We learn by processing new information against what we already know. But when something is entirely unfamiliar, with zero overlap? Our brains flag it as “abstract.” This is where stories come in. Stories anchor the unfamiliar in something we already understand. Humans have always been drawn to stories—they tap into our emotions, paint vivid pictures in our minds, and spark the imagination. And yet, in scientific communication, storytelling is considered too "unscientific"—not serious enough, not “technical” enough. We’ve been conditioned to believe that for science to be science, it must sound complicated. Formal. Dry. But that’s a myth. 📣 Effective storytelling with data starts with purpose and with understanding of certain key principles: 🔑 Core Principles 1️⃣ Who is your audience? The way you frame a story for a journal is very different from how you’d tell it on social media. 2️⃣ Start with your SOCO Always define your Single Overriding Communication Objective (SOCO). You’ll always have too many data points. Don’t try to say everything. Pick one key message—and stick with it. 3️⃣ Begin with the familiar Use the funnel approach: start wide, then narrow down. The audience doesn’t need the technicalities upfront. 4️⃣ Distill. Always distill. Distillation means pulling out only what the audience needs to know right now. Even Jesus said to His disciples: “I have much to say to you, but you cannot bear it now.” The moral? Less is more. Teach in layers. 5️⃣ Teach generalities first. Save the exceptions for later. This is where so much scientific and medical education goes wrong. We try to teach everything all at once—general rules and exceptions. But we must learn to crawl before we walk, and walk before we fly. 🧠 Take nutrition education, for example: In elementary school, you were taught that beans are a protein source and potatoes are carbs. ❓ Could they have told you beans contain 22% protein and 62% carbs? Of course. But that level of detail was unnecessary for a beginner (plus two things can be both true: beans could be a protein source and still be predominantly carbs). 6️⃣ Keep it short, simple, and coherent People are busy. Attention spans are short. Stay focused. Be concise. Make sure there’s a clear thread from beginning to end. 7️⃣ Don’t take yourself too seriously 😄 If your storytelling is too stiff, it loses its spark. Good stories meander a little—and that’s okay. 8️⃣ Make the analogy and its meaning memorable It’s not enough for people to remember the story—they must remember the lesson behind it. If they recall the metaphor but miss the message, you’ve missed the mark. A good scientific story should be: Simple ✅ Relatable ✅ Educational ✅ And ideally... a little fun 😄 In short: people can laugh, but they should also learn. Because when done right, storytelling with data isn’t fluff. Any damn fool can make something complicated 🤣 —it takes real skill to simplify without dumbing it down.
-
*** Statisticians: Their Most Common Mistake *** ~ One of the most common mistakes statisticians make when modeling is overfitting. This occurs when a model is too complex and captures the noise in the training data as if it were a genuine pattern. ~ Here are some detailed insights into this mistake and other potential pitfalls: ~ Overfitting: Description: * Too Complex Models: Overfitting happens when a model is too complex, performing exceptionally well on the training data but poorly on new, unseen data. Consequences: * Poor Generalization: An overfitted model fails to generalize to other datasets. * Misleading Insights: It provides misleading insights as it picks up on random noise rather than actual underlying patterns. Solutions: * Regularization: Techniques like Lasso and Ridge regression penalize the complexity of the model. * Cross-Validation: Cross-validation ensures the model performs well on unseen data. ~ Other Common Mistakes: 1. Ignoring Multicollinearity: Description: * Correlated Predictors: When predictors are highly correlated, unstable coefficient estimates are produced. Consequences: * Unreliable Estimates: The coefficients become very sensitive to changes in the model, leading to unreliable estimates. Solutions: * Principal Component Analysis (PCA): Reduce multicollinearity by transforming the predictors into uncorrelated components. 2. Not Checking Assumptions: Description: * Model Assumptions: Many statistical models, like linear regression, have underlying assumptions (e.g., linearity, homoscedasticity, independence of errors). Consequences: * Invalid Results: If these assumptions are violated, the model's results might not be valid. Solutions: * Diagnostic Plots: To check assumptions, use residual plots, Q-Q plots, and other diagnostics. 3. Inadequate Data Preprocessing: Description: * Raw Data: Using raw data without proper preprocessing, such as handling missing values, outliers, and scaling. Consequences: * Poor Model Performance: Inadequate preprocessing can lead to poor model performance. Solutions: * Data Cleaning: Ensure thorough data cleaning and preprocessing. 4. Data Snooping (P-Hacking): Description: * Multiple Testing: Testing multiple hypotheses on the same data can lead to spurious findings. Consequences: * False Discoveries increase the risk of Type I errors (false positives). Solutions: * Adjust P-Values: Use the Bonferroni correction to adjust for multiple comparisons. 5. Ignoring Outliers: Description: * Influential Outliers: Outliers can have a disproportionate influence on the model parameters. Consequences: * Biased Estimates: The model parameters become biased. Solutions: * Outlier Detection: Use statistical tests and visualization techniques to handle outliers. ~ Conclusion: While overfitting is one of the most common mistakes, statisticians must be aware of several pitfalls when modeling data. By understanding these issues, they can build more reliable models. --- B. Noted