Data Quality Checks to Improve Data Reliability

Explore top LinkedIn content from expert professionals.

Summary

Ensuring data accuracy and completeness through systematic quality checks is essential to improving data reliability, minimizing errors, and avoiding costly mistakes across industries like healthcare, technology, and beyond.

Establish clear standards: Define data quality rules upfront, such as ensuring completeness, consistency, and accuracy, to reduce errors at the source.
Implement automated monitoring: Use real-time dashboards and AI-powered tools to identify anomalies, missing data, or schema changes as they occur.
Incorporate data validation steps: Perform regular audits, reconciliation checks, and detailed schema validations to maintain data integrity throughout the pipeline.

Summarized by AI based on LinkedIn member posts

Amanjeet Singh

Seasoned AI, analytics and cloud software business leader, currently Head of Strategy & Operations and Strategic Business Unit Leader at Axtria Inc.

6,262 followers 1y
Report this post
Managing data quality is critical in the pharma industry because poor data quality leads to inaccurate insights, missed revenue opportunities, and compliance risks. The industry is estimated to lose between $15 million to $25 million annually per company due to poor data quality, according to various studies. To mitigate these challenges, the industry can adopt AI-driven data cleansing, enforce master data management (MDM) practices, and implement real-time monitoring systems to proactively detect and address data issues. There are several options that I have listed below: Automated Data Reconciliation: Set up an automated and AI enabled reconciliation process that compares expected vs. actual data received from syndicated data providers. By cross-referencing historical data or other data sources (such as direct sales reports or CRM systems), discrepancies, like missing accounts, can be quickly identified. Data Quality Dashboards: Create real-time dashboards that display prescription data from key accounts, highlighting any gaps or missing data as soon as it occurs. These dashboards can be designed with alerts that notify the relevant teams when an expected data point is missing. Proactive Exception Reporting: Implement exception reports that flag missing or incomplete data. By establishing business rules for prescription data based on historical trends and account importance, any deviation from the norm (like missing data from key accounts) can trigger alerts for further investigation. Data Quality Checks at the Source: Develop specific data quality checks within the data ingestion pipeline that assess the completeness of account-level prescription data from syndicated data providers. If key account data is missing, this would trigger a notification to your data management team for immediate follow-up with the data providers. Redundant Data Sources: To cross-check, leverage additional data providers or internal data sources (such as sales team reports or pharmacy-level data). By comparing datasets, missing data from syndicated data providers can be quickly identified and verified. Data Stewardship and Monitoring: Assign data stewards or a dedicated team to monitor data feeds from syndicated data providers. These stewards can track patterns in missing data and work closely with data providers to resolve any systemic issues. Regular Audits and SLA Agreements: Establish a service level agreement (SLA) with data providers that includes specific penalties or remedies for missing or delayed data from key accounts. Regularly auditing the data against these SLAs ensures timely identification and correction of missing prescription data. By addressing data quality challenges with advanced technologies and robust management practices, the industry can reduce financial losses, improve operational efficiency, and ultimately enhance patient outcomes.

3 Comments
Like Comment
Chad Sanderson

CEO @ Gable.ai (Shift Left Data Platform)

89,480 followers 9mo
Report this post
Many companies talk about implementing data contracts and shifting left, but Zakariah S. and the team at Glassdoor have actually done it. In an article published earlier today, the Glassdoor Data Platform team goes in-depth about how they have started driving data quality from the source through data contracts, proactive monitoring/observability, and Data DevOps. Here's a great quote from the article on the value of Shifting Left: "This approach offers many benefits, but the top four we’ve observed are: Data Quality by Design: Incorporating data quality checks early in the lifecycle helps prevent bad data from entering production systems. Fewer Downstream Breakages: By resolving potential issues closer to the source, the entire data pipeline becomes more resilient and less susceptible to cascading failures. Stronger Collaboration: Equipping product engineers with tools, frameworks, and guidelines to generate high-quality data nurtures a closer partnership between data producers and consumers. Cost & Time Efficiency: Preventing bad data is significantly cheaper than diagnosing and fixing it after propagating across multiple systems. These were the foundational principles upon which our motivation for shifting left was achieved." Glassdoor achieved this through six primary technology investments: Data Contracts (Gable.ai): Define clear specifications for fields, types, and constraints, ensuring product engineers are accountable for data quality from the start. Static Code Analysis (Gable.ai): Integrated with GitLab/GitHub and Bitrise to catch and block problematic data changes before they escalate downstream. LLMs for Anomaly Detection (Gable.ai): Identify subtle issues (e.g., swapped field names) that may not violate contracts but could lead to data inconsistencies. Schema Registry (Confluent): Screens incoming events, enforcing schema validation and directing invalid data to dead-letter queues to keep pipelines clean. Real-time Monitoring (DataDog): Provides continuous feedback loops to detect and resolve issues in real time. Write-Audit-Publish (WAP) / Blue-Green Deployment: Ensures each data batch passes through a staging area before being promoted to production, isolating risks before they impact downstream consumers. "By addressing the psychological dimension of trust through shared responsibility, transparent validation, and confidence-building checks, we’re scaling to petabytes without compromising our data’s essential sense of faith. Ultimately, this combination of technical rigor and cultural awareness empowers us to build resilient, trustworthy data systems — one contract, one check, and one validation at a time." It's a fascinating article and insight into incredibly sophisticated thinking around data quality and governance. You can check out the link below: https://lnkd.in/d-ADip42 Good luck!
No more previous content

No more next content
26 Comments
Like Comment
Joseph M.

Data Engineer, startdataengineering.com | Bringing software engineering best practices to data engineering.

47,900 followers 6mo
Report this post
It took me 10 years to learn about the different types of data quality checks; I'll teach it to you in 5 minutes: 1. Check table constraints The goal is to ensure your table's structure is what you expect: * Uniqueness * Not null * Enum check * Referential integrity Ensuring the table's constraints is an excellent way to cover your data quality base. 2. Check business criteria Work with the subject matter expert to understand what data users check for: * Min/Max permitted value * Order of events check * Data format check, e.g., check for the presence of the '$' symbol Business criteria catch data quality issues specific to your data/business. 3. Table schema checks Schema checks are to ensure that no inadvertent schema changes happened * Using incorrect transformation function leading to different data type * Upstream schema changes 4. Anomaly detection Metrics change over time; ensure it's not due to a bug. * Check percentage change of metrics over time * Use simple percentage change across runs * Use standard deviation checks to ensure values are within the "normal" range Detecting value deviations over time is critical for business metrics (revenue, etc.) 5. Data distribution checks Ensure your data size remains similar over time. * Ensure the row counts remain similar across days * Ensure critical segments of data remain similar in size over time Distribution checks ensure you get all the correct dates due to faulty joins/filters. 6. Reconciliation checks Check that your output has the same number of entities as your input. * Check that your output didn't lose data due to buggy code 7. Audit logs Log the number of rows input and output for each "transformation step" in your pipeline. * Having a log of the number of rows going in & coming out is crucial for debugging * Audit logs can also help you answer business questions Debugging data questions? Look at the audit log to see where data duplication/dropping happens. DQ warning levels: Make sure that your data quality checks are tagged with appropriate warning levels (e.g., INFO, DEBUG, WARN, ERROR, etc.). Based on the criticality of the check, you can block the pipeline. Get started with the business and constraint checks, adding more only as needed. Before you know it, your data quality will skyrocket! Good Luck! - Like this thread? Read about they types of data quality checks in detail here 👇 https://lnkd.in/eBdmNbKE Please let me know what you think in the comments below. Also, follow me for more actionable data content. #data #dataengineering #dataquality

12 Comments
Like Comment

Data Quality Checks to Improve Data Reliability

Summary

More in Ensuring Data Quality

Explore categories