After building 10+ data warehouses over 10 years, I can teach you how to keep yours clean in 5 minutes. Most companies have messy data warehouses that nobody wants to use. Here's how to fix that: 1. Understand the business first Know how your company makes money • Meet with business stakeholders regularly • Map out business entities and interactions • Document critical company KPIs and metrics This creates your foundation for everything else. 2. Design proper data models Use dimensional modeling with facts and dimensions • Create dim_noun tables for business entities • Build fct_verb tables for business interactions • Store data at lowest possible granularity Good modeling makes queries simple and fast. 3. Validate input data quality Check five data verticals before processing • Monitor data freshness and consistency • Validate data types and constraints • Track size and metric variance Never process garbage data no matter the pressure. 4. Define single source of truth Create one place for metrics and data • Define all metrics in data mart layer • Ensure stakeholders use SOT data only • Track data lineage and usage patterns This eliminates "the numbers don't match" conversations. 5. Keep stakeholders informed Communication drives warehouse adoption and resources • Document clear need and pain points • Demo benefits with before/after comparisons • Set realistic expectations with buffer time • Evangelize wins with leadership regularly No buy-in means no resources for improvement. 6. Watch for organizational red flags Some problems you can't solve with better code • Leadership doesn't value data initiatives • Constant reorganizations disrupt long-term projects • Misaligned teams with competing objectives • No dedicated data team support Sometimes the solution is finding a better company. 7. Focus on progressive transformation Use bronze/silver/gold layer architecture • Validate data before transformation begins • Transform data step by step • Create clean marts for consumption This approach makes debugging and maintenance easier. 8. Make data accessible Build one big tables for stakeholders • Join facts and dimensions appropriately • Aggregate to required business granularity • Calculate metrics in one consistent place Users prefer simple tables over complex joins. Share this with your network if it helps you build better data warehouses. How do you handle data warehouse maintenance? Share your approach in the comments below. ----- Follow me for more actionable content. #DataEngineering #DataWarehouse #DataQuality #DataModeling #DataGovernance #Analytics
How to Ensure Data Quality and Relevance in Business
Explore top LinkedIn content from expert professionals.
Summary
Maintaining data quality and relevance in business is about ensuring the data you use is accurate, consistent, and meaningful for decision-making. This involves proactive measures to clean, validate, and manage data throughout its lifecycle.
- Create a structured foundation: Establish clear processes for data modeling, validation, and governance to build a reliable data system that aligns with business needs.
- Implement automated checks: Use tools and scripts to monitor data accuracy, consistency, and timeliness, flagging anomalies before they disrupt workflows.
- Foster cross-functional collaboration: Promote a shared responsibility for data quality by engaging stakeholders across teams to align on definitions, priorities, and improvements.
-
-
I’ve lost count of projects that shipped gorgeous features but relied on messy data assets. The cost always surfaces later when inevitable firefights, expensive backfills, and credibility hits to the data team occur. This is a major reason why I argue we need to incentivize SWEs to treat data as a first-class citizen before they merge code. Here are five ways you can help SWEs make this happen: 1. Treat data as code, not exhaust Data is produced by code (regardless of whether you are the 1st party producer or ingesting from a 3rd party). Many software engineers have minimal visibility into how their logs are used (even the business-critical ones), so you need to make it easy for them to understand their impact. 2. Automate validation at commit time Data contracts enable checks during the CI/CD process when a data asset changes. A failing test should block the merge just like any unit test. Developers receive instant feedback instead of hearing their data team complain about the hundredth data issue with minimal context. 3. Challenge the "move fast and break things" mantra Traditional approaches often postpone quality and governance until after deployment, as shipping fast feels safer than debating data schemas at the outset. Instead, early negotiation shrinks rework, speeds onboarding, and keeps your pipeline clean when the feature's scope changes six months in. Having a data perspective when creating product requirement documents can be a huge unlock! 4. Embed quality checks into your pipeline Track DQ metrics such as null ratios, referential breaks, and out-of-range values on trend dashboards. Observability tools are great for this, but even a set of SQL queries that are triggered can provide value. 5. Don't boil the ocean; Focus on protecting tier 1 data assets first Your most critical but volatile data asset is your top candidate to try these approaches. Ideally, there should be meaningful change as your product or service evolves, but that change can lead to chaos. Making a case for mitigating risk for critical components is an effective way to make SWEs want to pay attention. If you want to fix a broken system, you start at the source of the problem and work your way forward. Not doing this is why so many data teams I talk to feel stuck. What’s one step your team can take to move data quality closer to SWEs? #data #swe #ai
-
Many companies talk about implementing data contracts and shifting left, but Zakariah S. and the team at Glassdoor have actually done it. In an article published earlier today, the Glassdoor Data Platform team goes in-depth about how they have started driving data quality from the source through data contracts, proactive monitoring/observability, and Data DevOps. Here's a great quote from the article on the value of Shifting Left: "This approach offers many benefits, but the top four we’ve observed are: Data Quality by Design: Incorporating data quality checks early in the lifecycle helps prevent bad data from entering production systems. Fewer Downstream Breakages: By resolving potential issues closer to the source, the entire data pipeline becomes more resilient and less susceptible to cascading failures. Stronger Collaboration: Equipping product engineers with tools, frameworks, and guidelines to generate high-quality data nurtures a closer partnership between data producers and consumers. Cost & Time Efficiency: Preventing bad data is significantly cheaper than diagnosing and fixing it after propagating across multiple systems. These were the foundational principles upon which our motivation for shifting left was achieved." Glassdoor achieved this through six primary technology investments: Data Contracts (Gable.ai): Define clear specifications for fields, types, and constraints, ensuring product engineers are accountable for data quality from the start. Static Code Analysis (Gable.ai): Integrated with GitLab/GitHub and Bitrise to catch and block problematic data changes before they escalate downstream. LLMs for Anomaly Detection (Gable.ai): Identify subtle issues (e.g., swapped field names) that may not violate contracts but could lead to data inconsistencies. Schema Registry (Confluent): Screens incoming events, enforcing schema validation and directing invalid data to dead-letter queues to keep pipelines clean. Real-time Monitoring (DataDog): Provides continuous feedback loops to detect and resolve issues in real time. Write-Audit-Publish (WAP) / Blue-Green Deployment: Ensures each data batch passes through a staging area before being promoted to production, isolating risks before they impact downstream consumers. "By addressing the psychological dimension of trust through shared responsibility, transparent validation, and confidence-building checks, we’re scaling to petabytes without compromising our data’s essential sense of faith. Ultimately, this combination of technical rigor and cultural awareness empowers us to build resilient, trustworthy data systems — one contract, one check, and one validation at a time." It's a fascinating article and insight into incredibly sophisticated thinking around data quality and governance. You can check out the link below: https://lnkd.in/d-ADip42 Good luck!
-
This week, I want to talk about something that might not be the most exciting or sexy topic—it might even seem plain boring to some of you. Very impactful, yet even in many large and complex organizations with tons of data challenges this foundational data process simply doesn’t exist: the Data Issue Management Process. Why is this so critical? Because #data issues, such as data quality problems, pipeline breakdowns, or process inefficiencies, can have real business consequences. They cause manual rework, compliance risks, and failed analytical initiatives. Without a structured way to identify, analyze, and resolve these issues, organizations waste time duplicating efforts, firefighting, and dealing with costly disruptions. The image I’ve attached outlines my take on a standard end-to-end data issue management process, broken down below: 📝 Logging the Issue – Make it simple and accessible for anyone in the organization to log an issue. If the process is too complicated, people will bypass it, leaving problems unresolved. ⚖️ Assessing the Impact – Understand the severity and business implications of the issue. This helps prioritize what truly matters and builds a case for fixing the problem. 👤 Assigning Ownership – Ensure clear accountability. Ownership doesn’t mean fixing the issue alone—it means driving it toward resolution with the right support and resources. 🕵️♂️ Analyzing the Root Cause – Trace the problem back to its origin. Most issues aren’t caused by systems, but by process gaps, manual errors, or missing controls. 🛠️ Resolving the Issue – Fix the data AND the root cause. This could mean improving data quality controls, updating business processes, or implementing technical fixes. 👀 Tracking and Monitoring – Keep an eye on open issues to ensure they don’t get stuck in limbo. Transparency is key to driving resolution. 🏁 Closing the Issue and Documenting the Resolution – Ensure the fix is verified, documented, and lessons are captured to prevent recurrence. Data issue management might not be flashy, but it can be very impactful. Giving business teams a place to flag issues and actually be heard, transforms endless complaints (because yes, they do love to complain about “the data”) into real solutions. And when organizations step back to identify and fix thematic patterns instead of just one-off issues, the impact can go from incremental to game-changing. For the full article ➡️ https://lnkd.in/eWBaWjbX #DataGovernance #DataManagement #DataQuality #BusinessEfficiency
-
It’s no revelation that incentives and KPIs drive good behavior. Sales compensation plans are scrutinized so closely that they often rise to the topic of board meetings. What if we gave the same attention to data quality scorecards? In the wake of Citigroup’s landmark data quality fine, it’s easy to imagine how a concern for data health benchmarks could have prevented the sting of regulatory intervention. But that’s then and this is now. The only question now is how do you avoid the same fate? Even in their heyday, traditional data quality scorecards from the Hadoop era were rarely wildly successful. I know this because prior to starting Monte Carlo, I spent years as an operations VP trying to create data quality standards that drove trust and adoption. Whether it’s a lack of funding or lack of stakeholder buy-in or cultural adoption, most data quality initiatives fail before they even get off the ground. As I said last week, a successful data quality program is a mix of three things: cross-functional buy-in, process, and action.And if any one of those elements is missing, you might find yourself next in line for regulatory review. Here are 4 key lessons for building data quality scorecards that I’ve seen to be the difference between critical data quality success—and your latest initiative pronounced dead on arrival: 1. Know what data matters—the best only way to determine what matters is to talk to the business. So get close to the business early and often to understand what matters to your stakeholders first. 2. Measure the machine—this means measuring components in the production and delivery of data that generally result in high quality. This often includes the 6 dimensions of data quality (validity, completeness, consistency, timeliness, uniqueness, accuracy), as well as things like usability, documentation, lineage, usage, system reliability, schema, and average time to fix. 3. Gather your carrots and sticks—the best approach I’ve seen here is to have a minimum set of requirements for data to be on-boarded onto the platform (stick) and a much more stringent set of requirements to be certified at each level (carrot). 4. Automate evaluation and discovery—Almost nothing in data management is successful without some degree of automation and the ability to self-service. The most common ways I’ve seen this done are with data observability and quality solutions, and data catalogs. Check out my full breakdown via link in the comments for more detail and real world examples.
-
Data quality is one of the most essential investments you can make when developing your data infrastructure. If you're data is "real-time" but it's wrong, guess what, you're gonna have a bad time. So how do you implement data quality into your pipelines? On a basic level you'll likely want to integrate some form of checks that could be anything from: - Anomaly and Range checks - These checks ensure that the data received fits an expected range or distribution. So let's say you only ever expect transactions of $5-$100 and you get a $999 transaction. That should set off alarms. In fact I have several cases where the business added new products or someone made a large business purchase that exceeded expectations that were flagged because of these checks - Data type checks - As the name suggests, this ensures that a date field is a date. This is important because if you're pulling files from a 3rd party they might send you headerless files that you have to trust they will keep sending you the same data in the same order. - Row count checks - A lot of businesses have a pretty steady rate of rows when it comes to fact tables. The number of transactions follow some sort of pattern, many are lower on the weekends and perhaps steadily growing over time. Row checks help ensure you don't see 2x the amount of rows because of a bad process or join. - Freshness checks - If you've worked in data long enough you'e likely had an executive bring up that your data was wrong. And it's less that the data was wrong, and more that the data was late(which is kind of wrong). Thus freshness checks make sure you know the data is late first so you can fix it or at least update those that need to know. - Category checks - The first category check I implemented was to ensure that every state abbreviation was valid. I assumed this would be true because they must use a drop down right? Well there were bad state abbreviations entered nonetheless As well as a few others. The next question would become how would you implement these checks and the solutions there range from setting up automated tasks that run during or after a table lands to dashboards to finally using far more developed tools that provide observability into far more than just a few data checks. If you're looking to dig deeper into the topic of data quality and how to implement it I have both a video and an article on the topic. 1. Video - How And Why Data Engineers Need To Care About Data Quality Now - And How To Implement It https://lnkd.in/gjMThSxY 2. Article - How And Why We Need To Implement Data Quality Now! https://lnkd.in/grWmDmkJ #dataengineering #datanalytics
-
Like it or not, working in data you’re going to spend a lot of time trying to make your data something that is usable at first. The Reality is Data is Messy Your starting data is rarely ever in the state that it is ready to be used for training models or for even making visualizations out of. It’s using disorganized, inconsistent, missing values, and may have strange characters all over the place. Tips to Streamline data cleaning process: 𝟭. 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲 𝗥𝗲𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝗧𝗮𝘀𝗸𝘀 This may be obvious but don’t try to clean your data line by line, write functions in scripts that can dynamically and automatically perform the same action on all your data. Common tasks like this would be writing a script using Pandas to remove duplicates, handle missing data in the same way across a column and standardize data type and formats 𝟮. 𝗖𝗿𝗲𝗮𝘁𝗲 𝗮 𝗗𝗮𝘁𝗮 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗖𝗵𝗲𝗰𝗸𝗹𝗶𝘀𝘁 Have a general list, can either be written out or in your head of common data issues that you always check for before beginning any analysis on your data. A few common ones are checking for missing data, outliers, unstandard data types, etc. 𝟯. 𝗩𝗮𝗹𝗶𝗱𝗮𝘁𝗲 𝘆𝗼𝘂𝗿 𝗗𝗮𝘁𝗮 Think of this as testing your data. For a tool like DBT, you can set up tests to ensure that as new data is added to a data pipeline, that the new data adheres to quality standards you set in the previous step. Tests can flag when new data is coming into your pipeline in unexpected ways and can help you build out edge case handling as new data comes in. 𝟰. 𝗗𝗼𝗰𝘂𝗺𝗲𝗻𝘁 𝘆𝗼𝘂𝗿 𝗱𝗮𝘁𝗮 𝗰𝗹𝗲𝗮𝗻𝗶𝗻𝗴 𝗽𝗿𝗼𝗰𝗲𝘀𝘀 Either in code comments or a document, keep record of what steps you did to clean your data before beginning to work with it in analyses. This is good for showing others what was already done, so they can point out if it needs anything else, ensures that the process can be reproducible for new but similar datasets and can help in bug finding situations. 𝟱. 𝗣𝗿𝗶𝗼𝗿𝗶𝘁𝗶𝘇𝗲 Not all data is equally important. Know which data to spend much more time and depth with and what you can cut corners on. For example, creating data cleaning script files in DBT for a company's sales data, you should spend a lot of time and effort to ensure everything is correct and that edge cases are handled appropriately. Often though, you don’t have the same amount of time to dedicate to every analysis. This is why it's important to build robust and accurate base data tables for data that is used often. What are your strategies for data cleaning?
-
My AI was ‘perfect’—until bad data turned it into my worst nightmare. 📉 By the numbers: 85% of AI projects fail due to poor data quality (Gartner). Data scientists spend 80% of their time fixing bad data instead of building models. 📊 What’s driving the disconnect? Incomplete or outdated datasets Duplicate or inconsistent records Noise from irrelevant or poorly labeled data Data quality The result? Faulty predictions, bad decisions, and a loss of trust in AI. Without addressing the root cause—data quality—your AI ambitions will never reach their full potential. Building Data Muscle: AI-Ready Data Done Right Preparing data for AI isn’t just about cleaning up a few errors—it’s about creating a robust, scalable pipeline. Here’s how: 1️⃣ Audit Your Data: Identify gaps, inconsistencies, and irrelevance in your datasets. 2️⃣ Automate Data Cleaning: Use advanced tools to deduplicate, normalize, and enrich your data. 3️⃣ Prioritize Relevance: Not all data is useful. Focus on high-quality, contextually relevant data. 4️⃣ Monitor Continuously: Build systems to detect and fix bad data after deployment. These steps lay the foundation for successful, reliable AI systems. Why It Matters Bad #data doesn’t just hinder #AI—it amplifies its flaws. Even the most sophisticated models can’t overcome the challenges of poor-quality data. To unlock AI’s potential, you need to invest in a data-first approach. 💡 What’s Next? It’s time to ask yourself: Is your data AI-ready? The key to avoiding AI failure lies in your preparation(#innovation #machinelearning). What strategies are you using to ensure your data is up to the task? Let’s learn from each other. ♻️ Let’s shape the future together: 👍 React 💭 Comment 🔗 Share
-
Ensuring data quality at scale is crucial for developing trustworthy products and making informed decisions. In this tech blog, the Glassdoor engineering team shares how they tackled this challenge by shifting from a reactive to a proactive data quality strategy. At the core of their approach is a mindset shift: instead of waiting for issues to surface downstream, they built systems to catch them earlier in the lifecycle. This includes introducing data contracts to align producers and consumers, integrating static code analysis into continuous integration and delivery (CI/CD) workflows, and even fine-tuning large language models to flag business logic issues that schema checks might miss. The blog also highlights how Glassdoor distinguishes between hard and soft checks, deciding which anomalies should block pipelines and which should raise visibility. They adapted the concept of blue-green deployments to their data pipelines by staging data in a controlled environment before promoting it to production. To round it out, their anomaly detection platform uses robust statistical models to identify outliers in both business metrics and infrastructure health. Glassdoor’s approach is a strong example of what it means to treat data as a product: building reliable, scalable systems and making quality a shared responsibility across the organization. #DataScience #MachineLearning #Analytics #DataEngineering #DataQuality #BigData #MLOps #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gUwKZJwN