Putting pressure on data science teams to deliver analytical value with LLMs is cruel and unusual punishment without a scalable data foundation. Over time, the best LLMs will be able to write queries as effectively or more effectively than an analyst - or at minimum make writing the query easier. However, the most cost-intensive aspect of answering business questions is not producing SQL, but deciding what the query inputs should be and determining whether or not the inputs are trustworthy. Thanks to the rapid evolution of microservices and data lakes, data teams find themselves living in a world of fragmented truth. The same data points might be collected by multiple services, defined in multiple different ways, and could actually be going in opposite and contradictory directions. Today, data developers must do the hard work of understanding and resolving those discrepancies, which comes in the form of 1-to-1 conversations with the engineers managing logs and databases. Very few if any service teams at a company have documented their data for the purpose of analytics. That results in a giant gap in documentation across 1000s of datasets across the business. Without this gap being filled, data scientists will ultimately have to manually hand-check any prediction that an LLM makes in order to ensure it is accurate and not hallucinating. The model is doing a job with the information it has, but the business is not providing enough information for the model to deliver trustworthy outcomes! By investing in a scalable data foundation, this paradigm flips on its head. Data is well documented, clearly owned, and structured as an API enforced by contracts that define the use case, constraints, SLAs, and semantic meaning. A quality-driven infrastructure is a subset of all data in the lake, which reduces the surface area LLMs need to make decisions only to the nodes in the lineage graph which have clear governance and change management. Here's what I suggest: 1. Start by identifying which pipelines are most essential to answering the business's most common questions (you can do this by accessing query history) 2. Identify the core use cases (datasets/views) that are leveraged in these pipelines, and which intermediary tables are of critical importance 3. Define semantically what the data means at each level in the transformation. A good question to ask is "What does a single row in this table represent?" 4. Validate the semantic meaning with the table owners 5. Get the table owners to take ownership of the dataset asn API, ideally supported programmatically through a data contract 6. Define the semantic meaning and constraints within the data contract spec, mapped to a source file 6. Limit any usage of an LLM to the source files under contract Good luck! #dataengineering
How Data Architecture Affects Analytics
Explore top LinkedIn content from expert professionals.
Summary
Data architecture plays a crucial role in determining how effectively businesses can analyze and draw insights from their data. By organizing, structuring, and maintaining data systems, a well-designed architecture ensures reliability, scalability, and accuracy in analytics, which is essential for informed decision-making.
- Build a strong foundation: Establish clear data ownership, enforce data governance, and create well-documented, scalable systems to ensure data is trustworthy and accessible.
- Streamline data processes: Prioritize cleaning, transforming, and standardizing data at the source to reduce redundancy, prevent errors, and support analytics and AI applications effectively.
- Create structured layers: Use frameworks like Medallion Architecture to incrementally improve data quality, moving from raw to refined datasets that are ready for business insights.
-
-
Many Data Engineers (my past self included) jump into pipelines without understanding how data should flow in a lakehouse. That’s where Medallion Architecture changed everything for me. If you’ve ever wondered how to organize raw, messy data into analytics-ready gold, this one’s for you. Medallion Architecture is a powerful data design pattern used to logically structure and refine data in a data lakehouse environment. It’s designed to incrementally enhance data quality and usability as it moves through different layers. Why Medallion Architecture is Used? -Provides a structured approach to progressively improve data quality through multiple stages. -Scales with business needs and handles large volumes from diverse sources. -Improves data lineage, governance, and compliance tracking. -Helps create a single, unified view of enterprise data. Layers of Medallion Architecture: Bronze Layer – Raw Data 🔹 Purpose: Initial data ingestion and storage 🔹 Characteristics: -Unprocessed, schema-less, or semi-structured -Original format (logs, streaming data, CSVs, JSON, etc.) -Stored in scalable storage (S3, Azure Blob, HDFS) -Immutable and complete history preserved Silver Layer – Cleaned Data 🔹 Purpose: Data cleansing, normalization, and schema enforcement 🔹 Characteristics: -Validated and structured data -Merged from multiple sources -Stored in managed tables (e.g., Delta Lake) -Prepares data for downstream analytics Gold Layer – Refined Data 🔹 Purpose: Aggregation, enrichment, and business-level modeling 🔹 Characteristics: -High-quality, query-optimized datasets -Stored in data warehouses or lakehouse tables -Ready for BI tools, dashboards, and ML models -Enables streaming analytics and high concurrency There can be more or fewer layers depending on your architecture and business needs — but the Bronze → Silver → Gold model provides a scalable and modular foundation. 🏅 This architecture isn’t just about organizing data — it’s about building trust, traceability, and value in every dataset. (GIF credit: ilum.cloud) #DataEngineering #MedallionArchitecture #DataLakehouse #DeltaLake #BigData #ETL #DataGovernance #Datalake #Spark #BI #StreamingAnalytics #MachineLearning
-
We’ve built a system where every team hacks together their own data pipelines, reinventing the wheel with every use case. Medallion architectures, once a necessary evil, now feel like an expensive relic, layers of redundant ETL jobs, cascading schema mismatches, and duplicated processing logic. Instead of propagating this mess downstream, shift it left to the operational layer. Do schema enforcement, deduplication, and transformation once, at the source, rather than five times in five different pipelines. Push processing upstream, closer to where the data is generated, instead of relying on a brittle patchwork of batch jobs. Adam Bellemare’s InfoQ article (link below) lays it out clearly: Multi-hop architectures are slow, costly, and error-prone. They depend on reactive data consumers pulling data, cleaning it, and shaping it after the fact. The alternative? Treat data like an API contract. Push standardization into the producer layer. Emit well-formed, semantically correct event streams that can be consumed directly by both operational and analytical systems, without the usual ETL contortions. The old way, letting every team fend for themselves, writing brittle ETL for a dozen variations of the same dataset, creates a maintenance nightmare and is unfair to the data teams that get stuck with disentangling the mess. Shift left. Make clean, high-quality data a first-class product, not an afterthought. No one studied computer science so they could spend their work life cleaning data. So, why are we still defending architectures built for the constraints of 20 years ago? Check out Adam's article for more on this: https://lnkd.in/g27m5ZwV
-
AI is only as powerful as the data it learns from. But raw data alone isn’t enough—it needs to be collected, processed, structured, and analyzed before it can drive meaningful AI applications. How does data transform into AI-driven insights? Here’s the data journey that powers modern AI and analytics: 1. 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲 𝗗𝗮𝘁𝗮 – AI models need diverse inputs: structured data (databases, spreadsheets) and unstructured data (text, images, audio, IoT streams). The challenge is managing high-volume, high-velocity data efficiently. 2. 𝗦𝘁𝗼𝗿𝗲 𝗗𝗮𝘁𝗮 – AI thrives on accessibility. Whether on AWS, Azure, PostgreSQL, MySQL, or Amazon S3, scalable storage ensures real-time access to training and inference data. 3. 𝗘𝗧𝗟 (𝗘𝘅𝘁𝗿𝗮𝗰𝘁, 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺, 𝗟𝗼𝗮𝗱) – Dirty data leads to bad AI decisions. Data engineers build ETL pipelines that clean, integrate, and optimize datasets before feeding them into AI and machine learning models. 4. 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗲 𝗗𝗮𝘁𝗮 – Data lakes and warehouses such as Snowflake, BigQuery, and Redshift prepare and stage data, making it easier for AI to recognize patterns and generate predictions. 5. 𝗗𝗮𝘁𝗮 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 – AI doesn’t work in silos. Well-structured dimension tables, fact tables, and Elasticube models help establish relationships between data points, enhancing model accuracy. 6. 𝗔𝗜-𝗣𝗼𝘄𝗲𝗿𝗲𝗱 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀 – The final step is turning data into intelligent, real-time business decisions with BI dashboards, NLP, machine learning, and augmented analytics. AI without the right data strategy is like a high-performance engine without fuel. A well-structured data pipeline enhances model performance, ensures accuracy, and drives automation at scale. How are you optimizing your data pipeline for AI? What challenges do you face when integrating AI into your business? Let’s discuss.
-
Most software engineers assume that as long as they can query a database, they have all the data they need. Then, they try to build a new product feature that relies on real-time user behavior, or maybe an ML model that requires enriched customer profiles, or even a simple reporting dashboard with accurate, up-to-date metrics. And suddenly, they hit a wall. • Where is the event data from the frontend? • Why don’t the customer attributes match across databases? • What even is the source of truth? This is the reality of data fragmentation. Every modern company has teams generating data in different places, such as application databases, event streams, SaaS tools, and warehouse tables. However, accessing and integrating this data isn’t just a data engineering problem anymore. Software engineers need to think about data architecture, too. This is where federated data comes in. Federation allows you to query across distributed systems without centralizing everything first. Instead of waiting for months-long ETL projects, teams can work with live data across multiple sources in real-time. For engineering leaders, this means: ✅ Less waiting for data teams to move data around. ✅ More flexibility to access what you need, when you need it. ✅ A step toward building truly data-driven applications. If you’re starting to see data needs creep into your engineering work, you’re not alone. The way we build products is changing. #swe #data #ai
-
I've conducted DE system-design interviews for 10 years. I'll teach you the key concepts to know in 10 minutes: 1. Partitioning > Process/store data based on column values. - Partitioning parallelizes work (process & reads). - Storage: Partition datasets to enable distributed systems to read in parallel. - Processing: Partitioned data allows all machines in a cluster to process independently. Columns to partition by depend on processing needs or read patterns. 2. Data storage patterns > Storing data properly ensures efficient consumers. - Partition: see ^. - Clustering: Keeps similar values in specified columns together. Ideal for high-cardinality or continuous values. - Encoding: Metadata in table/columnar file formats helps engines read only necessary data. 3. Data modeling > Table design (grain & schema) determines warehouse success. - Dimension: Rows represent entities in your business (e.g., customers). - Fact: Rows represent events (e.g., orders). Kimball’s dimensional model is the most widely used approach. 4. Data architecture Understand system interactions: - Queue/logging systems handle constant data streams. - Distributed storage is cheap for raw/processed data (use partitioning if needed). - Data processing systems (e.g., Spark) read, process & write to distributed stores. - Data access layer (e.g., Looker on Snowflake) allows end-user access. 5. Data flow > Most batch systems clean & transform data in layers: - Raw: Input data stored as is. - Bronze: Apply proper column names & types. - Silver: Model data (e.g., Kimball). Create fact/dimension tables. - Gold: Create tables for end-users or use a semantic layer to generate queries on demand. 6. Lambda & Kappa architecture > Faster insights provide competitive advantages. - Lambda: Combines batch (slow) & stream (fast) pipelines for stable & trending data. - Kappa: Uses a single stream-processing flow (e.g., Apache Flink), simplifying maintenance. 7. Stream processing Key aspects: - State & time: Store in-memory data for wide transformations (e.g., joins, windows). - Joins: Use time as a criterion; rows from one stream can’t wait indefinitely for another. - Watermark: Defines when data is complete, useful for late-arriving events. 8. Transformation types > Reduce data movement for optimized processes. - Narrow: Operates on single rows (e.g., substring, lower). - Wide: Operates on multiple rows (e.g., joins, group by). - Data shuffle: Wide operations require data movement between nodes, slowing processing. 9. Common patterns of questions > Companies focus on industry-specific needs: - Ads: Clickstream processing, modeling & user access. - Finance: Batch reporting, data modeling & quality. - Cybersecurity: Real-time intrusion detection from logs. Check out > https://lnkd.in/eVq5bwUW ---- What else should we cover? Enjoy this? Repost and follow for actionable data content. #data #dataengineering #datajobs #dataanalytics
-
Your AI projects aren't failing because of bad data quality or lack of skills. They are failing because you built them on very shaky foundation. Technology innovation is moving faster than ever before. Personal and business pressures to learn and adopt new tech innovation is intense. Companies of all sizes struggle to keep up, forcing them to cut corners, ignore engineering best practices and blindly throw lots of money at the problem. Results are poor. 74% of companies struggle to achieve value from AI projects. Only 26% have the necessary infra and skills to move their AI projects beyond POC* I believe a big reason for these failed AI projects is a lack of a solid data foundation and infrastructure to support these and other data intensive use cases. Not necessarily bad quality data or not enough data or lack of skills. Modern AI + Analytics Architecture (MAnAA) is a prescriptive approach to building a modern, scalable data infrastructure to support current and future data intensive use cases. At the bottom are object stores to persist data in a scalable and cost-effective manner. this isn't new idea, but in recent years have seen tremendous adoption by tools and services cementing it as the goto for persisting data of all types and sizes. Above it is the universal table manager. Based on open table formats like #ApacheIceberg, this standards-based layer allows engines to find, understand, optimize and access datasets of all kinds - columnar, row-wise, vectors, blobs, etc. This unification simplifies interoperability and enables seamless data sharing. Data catalogs are experiencing a renaissance, unifying business and technical metadata together with service endpoint (Iceberg REST) and context awareness. Catalogs allow organizations to organize, secure and route information for users and AI agents to find and use all available data, #DataHub and #UnityCatalog for example. Compute services have exploded with so many amazing tools and services that can analyze and build data apps and AI agents. However, it's a major challenge for engineers to adopt them. MAnAA encourages a compute marketplace by which tools and services can easily plug-in, not bolt-on, enabling fast adoption. Plugging in tools is accomplished with... 1/ proper compatibility and interoperability with standard interfaces - tools that support Iceberg table format and Iceberg REST catalog already gets us there. 2/ via BYOC enables companies to either use a vendor's compute (Snowflake, Databricks, etc.) or slot in their own K8S or another platform. The latter helps control and reduce costs and make the solution more portable and flexible, especially when dealing with multi-cloud and on-prem deployments. Questionable data quality is something we can live with, broken foundation will stop you from succeeding 🤔 More about MAnAA in my post linked in the comments. p.s. do you see your company heading in this direction?