It’s no revelation that incentives and KPIs drive good behavior. Sales compensation plans are scrutinized so closely that they often rise to the topic of board meetings. What if we gave the same attention to data quality scorecards? In the wake of Citigroup’s landmark data quality fine, it’s easy to imagine how a concern for data health benchmarks could have prevented the sting of regulatory intervention. But that’s then and this is now. The only question now is how do you avoid the same fate? Even in their heyday, traditional data quality scorecards from the Hadoop era were rarely wildly successful. I know this because prior to starting Monte Carlo, I spent years as an operations VP trying to create data quality standards that drove trust and adoption. Whether it’s a lack of funding or lack of stakeholder buy-in or cultural adoption, most data quality initiatives fail before they even get off the ground. As I said last week, a successful data quality program is a mix of three things: cross-functional buy-in, process, and action.And if any one of those elements is missing, you might find yourself next in line for regulatory review. Here are 4 key lessons for building data quality scorecards that I’ve seen to be the difference between critical data quality success—and your latest initiative pronounced dead on arrival: 1. Know what data matters—the best only way to determine what matters is to talk to the business. So get close to the business early and often to understand what matters to your stakeholders first. 2. Measure the machine—this means measuring components in the production and delivery of data that generally result in high quality. This often includes the 6 dimensions of data quality (validity, completeness, consistency, timeliness, uniqueness, accuracy), as well as things like usability, documentation, lineage, usage, system reliability, schema, and average time to fix. 3. Gather your carrots and sticks—the best approach I’ve seen here is to have a minimum set of requirements for data to be on-boarded onto the platform (stick) and a much more stringent set of requirements to be certified at each level (carrot). 4. Automate evaluation and discovery—Almost nothing in data management is successful without some degree of automation and the ability to self-service. The most common ways I’ve seen this done are with data observability and quality solutions, and data catalogs. Check out my full breakdown via link in the comments for more detail and real world examples.
How to Ensure Data Accuracy in Organizations
Explore top LinkedIn content from expert professionals.
Summary
Ensuring data accuracy in organizations means maintaining reliable and error-free information to support decision-making, processes, and compliance. By prioritizing clean and actionable data, businesses can avoid costly mistakes and build trust in their systems.
- Define and prioritize needs: Identify critical data points that directly impact business goals by collaborating with teams to understand what matters most to their operations.
- Implement regular checks: Schedule ongoing validation processes such as anomaly detection, schema verification, and distribution monitoring to catch inaccuracies early.
- Automate and audit: Use automated tools to monitor data quality and maintain logs to track issues, ensuring continuous improvement and transparency.
-
-
It took me 10 years to learn about the different types of data quality checks; I'll teach it to you in 5 minutes: 1. Check table constraints The goal is to ensure your table's structure is what you expect: * Uniqueness * Not null * Enum check * Referential integrity Ensuring the table's constraints is an excellent way to cover your data quality base. 2. Check business criteria Work with the subject matter expert to understand what data users check for: * Min/Max permitted value * Order of events check * Data format check, e.g., check for the presence of the '$' symbol Business criteria catch data quality issues specific to your data/business. 3. Table schema checks Schema checks are to ensure that no inadvertent schema changes happened * Using incorrect transformation function leading to different data type * Upstream schema changes 4. Anomaly detection Metrics change over time; ensure it's not due to a bug. * Check percentage change of metrics over time * Use simple percentage change across runs * Use standard deviation checks to ensure values are within the "normal" range Detecting value deviations over time is critical for business metrics (revenue, etc.) 5. Data distribution checks Ensure your data size remains similar over time. * Ensure the row counts remain similar across days * Ensure critical segments of data remain similar in size over time Distribution checks ensure you get all the correct dates due to faulty joins/filters. 6. Reconciliation checks Check that your output has the same number of entities as your input. * Check that your output didn't lose data due to buggy code 7. Audit logs Log the number of rows input and output for each "transformation step" in your pipeline. * Having a log of the number of rows going in & coming out is crucial for debugging * Audit logs can also help you answer business questions Debugging data questions? Look at the audit log to see where data duplication/dropping happens. DQ warning levels: Make sure that your data quality checks are tagged with appropriate warning levels (e.g., INFO, DEBUG, WARN, ERROR, etc.). Based on the criticality of the check, you can block the pipeline. Get started with the business and constraint checks, adding more only as needed. Before you know it, your data quality will skyrocket! Good Luck! - Like this thread? Read about they types of data quality checks in detail here 👇 https://lnkd.in/eBdmNbKE Please let me know what you think in the comments below. Also, follow me for more actionable data content. #data #dataengineering #dataquality
-
Data quality is one of the most essential investments you can make when developing your data infrastructure. If you're data is "real-time" but it's wrong, guess what, you're gonna have a bad time. So how do you implement data quality into your pipelines? On a basic level you'll likely want to integrate some form of checks that could be anything from: - Anomaly and Range checks - These checks ensure that the data received fits an expected range or distribution. So let's say you only ever expect transactions of $5-$100 and you get a $999 transaction. That should set off alarms. In fact I have several cases where the business added new products or someone made a large business purchase that exceeded expectations that were flagged because of these checks - Data type checks - As the name suggests, this ensures that a date field is a date. This is important because if you're pulling files from a 3rd party they might send you headerless files that you have to trust they will keep sending you the same data in the same order. - Row count checks - A lot of businesses have a pretty steady rate of rows when it comes to fact tables. The number of transactions follow some sort of pattern, many are lower on the weekends and perhaps steadily growing over time. Row checks help ensure you don't see 2x the amount of rows because of a bad process or join. - Freshness checks - If you've worked in data long enough you'e likely had an executive bring up that your data was wrong. And it's less that the data was wrong, and more that the data was late(which is kind of wrong). Thus freshness checks make sure you know the data is late first so you can fix it or at least update those that need to know. - Category checks - The first category check I implemented was to ensure that every state abbreviation was valid. I assumed this would be true because they must use a drop down right? Well there were bad state abbreviations entered nonetheless As well as a few others. The next question would become how would you implement these checks and the solutions there range from setting up automated tasks that run during or after a table lands to dashboards to finally using far more developed tools that provide observability into far more than just a few data checks. If you're looking to dig deeper into the topic of data quality and how to implement it I have both a video and an article on the topic. 1. Video - How And Why Data Engineers Need To Care About Data Quality Now - And How To Implement It https://lnkd.in/gjMThSxY 2. Article - How And Why We Need To Implement Data Quality Now! https://lnkd.in/grWmDmkJ #dataengineering #datanalytics
-
HR wants clean data. Finance wants clean data. Accounting wants clean data. But here’s the truth: Clean data drives great profits. It’s not enough to just have data anymore. You need reliable, actionable data. Why? Because dirty data leads to dirty decisions. Yes. Here’s how to fix it with the CDE Method: 1. Collect What data do you have? Audit your data landscape and find the gaps. HR’s role: Review your HRIS, find critical gaps, and prioritize fixes. Finance’s role: Assess data sources for budget allocations, forecast accuracy, and spend tracking 2. Diagnose Where are the problems? Analyze the impact of messy data on your business. Incomplete compensation and position data affects pay equity analysis and compliance. HR’s role: Clean the data that matters most to business and regulatory goals. Finance’s role: Clean revenue and expense data impacting forecasting and decision-making accuracy. 3. Execute How do you fix it? Take action with a clear plan. Run monthly data validation checks and automate updates. HR’s role: Build governance processes for accurate employee records and ownership clarity. Finance’s role: Automate financial reconciliations and integrate tools for real-time expense tracking. Try this approach: a) Map your data landscape (Collect) b) Focus on what matters (Diagnose) c) Build lasting solutions (Execute) Remember: Perfect data is a myth, but clean data is non-negotiable. It’s time for HR and finance to stop treating data as an admin task and start seeing it as a strategic advantage. P.S. I'm Warren Wang, the CEO and founder of Doublefin. I spent 12 years at Google in finance leadership roles, including in Corp FP&A driving company-wide financial planning, headcount planning, and later as a finance director.