Automating data contract validation for engineers

Explore top LinkedIn content from expert professionals.

Summary

Automating data contract validation for engineers means using tools and processes to automatically check that shared data meets agreed standards before it’s used, helping teams avoid messy mistakes and improve data reliability. At its core, a data contract is a clear, standardized agreement describing how data should look and behave, which can be enforced by automation to prevent errors and miscommunication between teams.

Define clear contracts: Write simple, standardized rules that describe your data’s structure and quality requirements so both producers and users understand what’s expected.
Integrate checks early: Set up automatic validation in your code development pipeline so problems are caught before they affect your products or users.
Focus on critical data: Prioritize automating checks for your most important and frequently changing data assets to prevent headaches and costly fixes later.

Summarized by AI based on LinkedIn member posts

Andrew Jones Andrew Jones is an Influencer

📝 Principal Engineer. Builder of data platforms. Created data contracts and wrote the book on it. Father of 2. Brewer of beer. Aphantasic.

7,582 followers 5mo
Report this post
The initial idea for data contracts was to create an interface through which reliable and well-structured data could be made available to consumers. Like an API, but for data. To create an interface we first need a description of the data — the metadata — that contains enough detail to provision the interface in our system of choice. For example, we need a schema with fields and their types, which allows us to automate the creation and management of a table in the data warehouse. Then I realised, if we can automate the creation and management of an interface from this metadata, what else could we automate if we had a sufficient metadata? It turns out, 𝒆𝒗𝒆𝒓𝒚𝒕𝒉𝒊𝒏𝒈. Take data quality checks as an example. We don’t need every data owner to choose a framework to write the tests in, orchestrate running the tests, set up the alerting, and so on. All we need to do is allow them to define the checks they want to run in their data contract: ``` - name: id data_type: VARCHAR checks: - type: no_missing_values - type: no_duplicate_values - name: size data_type: VARCHAR checks: - type: invalid_count valid_values: ['S', 'M', 'L'] must_be_less_than: 10 ``` And the platform runs these checks for them, on the right schedule, and sending the alerts to them if/when these checks fail. This is great for the data owner. They can focus on creating and managing great data products that meet the needs of their users, not wasting their cognitive load worrying how to run their data quality checks. It’s also great for the data platform team to build in this way. Any capability they add to the data platform will immediately be adopted by all data owners and to all data managed by data contracts. In the ~5 years we’ve been doing data contracts we’ve implemented all our data platform capabilities through data contracts, and I can’t see any reason why we can’t continue to do so well into the future. Data contracts are a simple idea. Your just describing your data in a standardised human- and machine-readable format. But they’re so powerful. Powerful enough to build an entire data platform around.
No more previous content

No more next content
23 Comments
Like Comment
🎯 Mark Freeman II

Data Engineer | Tech Lead @ Gable.ai | O’Reilly Author: Data Contracts | LinkedIn [in]structor (28k+ Learners) | Founder @ On the Mark Data

63,144 followers 4mo
Report this post
I’ve lost count of projects that shipped gorgeous features but relied on messy data assets. The cost always surfaces later when inevitable firefights, expensive backfills, and credibility hits to the data team occur. This is a major reason why I argue we need to incentivize SWEs to treat data as a first-class citizen before they merge code. Here are five ways you can help SWEs make this happen: 1. Treat data as code, not exhaust Data is produced by code (regardless of whether you are the 1st party producer or ingesting from a 3rd party). Many software engineers have minimal visibility into how their logs are used (even the business-critical ones), so you need to make it easy for them to understand their impact. 2. Automate validation at commit time Data contracts enable checks during the CI/CD process when a data asset changes. A failing test should block the merge just like any unit test. Developers receive instant feedback instead of hearing their data team complain about the hundredth data issue with minimal context. 3. Challenge the "move fast and break things" mantra Traditional approaches often postpone quality and governance until after deployment, as shipping fast feels safer than debating data schemas at the outset. Instead, early negotiation shrinks rework, speeds onboarding, and keeps your pipeline clean when the feature's scope changes six months in. Having a data perspective when creating product requirement documents can be a huge unlock! 4. Embed quality checks into your pipeline Track DQ metrics such as null ratios, referential breaks, and out-of-range values on trend dashboards. Observability tools are great for this, but even a set of SQL queries that are triggered can provide value. 5. Don't boil the ocean; Focus on protecting tier 1 data assets first Your most critical but volatile data asset is your top candidate to try these approaches. Ideally, there should be meaningful change as your product or service evolves, but that change can lead to chaos. Making a case for mitigating risk for critical components is an effective way to make SWEs want to pay attention. If you want to fix a broken system, you start at the source of the problem and work your way forward. Not doing this is why so many data teams I talk to feel stuck. What’s one step your team can take to move data quality closer to SWEs? #data #swe #ai

4 Comments
Like Comment
Chad Sanderson

CEO @ Gable.ai (Shift Left Data Platform)

89,477 followers 2y
Report this post
Data Contracts are composed of two parts: The contract spec & the enforcement/validation mechanism. The contract spec should be defined in code, stored in a central repository, and version controlled. I prefer to do this using YAML, because YAML is extraordinarily flexible and can be translated into a variety of other types of schema serialization frameworks like Protobuf, Avro, and JSON schema. Once a contract spec has been defined, contract owners generate enforcement mechanisms at the appropriate place in the data pipeline to ensure the contract is being followed. The best place to manifest checks against schema is in CI/CD, as these can be truly preventative. Preventative enforcement can be blocking or informational. - Blocking mechanisms break CI/CD builds until contract violations are resolved - Informational mechanisms communicate to producers which consumers of the contract will be impacted by backward incompatible changes - Informational mechanisms can also be used to allow data consumers to better advocate for their downstream use cases on the PRs which will likely impact them. The combination of these two types of preventative frameworks creates awareness of how data is being used downstream, and allows the business to control how data evolves over time. Semantic checks should ideally shift left as far as possible - I recommend firing exceptions in the production codebase when value-based constraints are violated and also doing semantic validation on data in flight between data sources and destinations. This allows you to do 3 things very effectively: 1. Prevent simple backward incompatible semantics upstream 2. Action on contract violations on data in flight (such as tagging low quality data, stopping the pipeline entirely, or pushing the data to a staging table before consumers can see it) 3. Communicate to consumers when low-quality data is detected in advance Something I can't stress enough: Data contracts are a mechanism of COMMUNICATION. The entire point is to build visibility between data producers and data consumers into how data is being used, when it violates expectations, what 'quality' looks like, and who is responsible for it. Data contracts, combined with data lineage, downstream monitoring, and data catalogs form a metadata management layer that allows data engineers to remove themselves from the producer/consumer feedback loop and focus on creating solid infrastructure. Good luck! #dataengineering
No more previous content

No more next content
62 Comments
Like Comment

Automating data contract validation for engineers

Summary

More in Engineering Standards And Compliance

Explore categories