How to design observability pipelines with Vector by Datadog

This title was summarized by AI from the post below.

View organization page for DoubleVerify Engineering

590 followers

2w Edited

Observability is essential to ensuring the reliability of distributed systems -especially when those systems run in environments you don’t control. In his blog post, Surya Nittala, Sr. DevOps Engineer at DoubleVerify, shares a step-by-step guide to designing observability pipelines using Vector by Datadog. The post explains how the team collects and routes Prometheus metrics from Kubernetes clusters to a centralized Prometheus instance-efficiently and securely. Read more here: https://lnkd.in/dwjbH3vQ Surya N.

To view or add a comment, sign in

More Relevant Posts

Ahmad Agasheh

DevOps Engineer
3w
Report this post
Hi During my recent work on observability and high-availability metrics, I explored Thanos deeply. I’ve put together a document that explains everything I learned — from architecture to deployment and common pitfalls. I hope it helps other DevOps engineers who are trying to scale Prometheus and centralize metrics storage.

6 Comments
Like Comment
To view or add a comment, sign in
Sudhesh Goud G

Enterprise DevOps & Cloud Migration Specialist | Delivered $360K Revenue Growth | $25K+ Monthly Cost Savings | 99.9% Uptime for Fortune 500 Clients | Expertise in Business Expansion in UAE/KSA
1mo Edited
Report this post
💥 DevOps Nightmare Series #2: See Everything Clearly 🌐 Most teams think they have monitoring. Few truly have observability. And that difference decides whether you're asleep at 3 AM — or awake firefighting. 😰 Here is how I built end to end visibility using Loki, Prometheus, Grafana, and Jaeger across multi cluster Kubernetes. ☁️ 👀 1. Logs & Metrics — The Foundation Started with Loki for centralized logs and Prometheus for metrics. Unified both inside Grafana giving me all metrics, logs, and alerts all in one view to troubleshoot faster.. The one thing you cannot fix is the one you cannot see. 📊Key metrics tracked: Request latency (P95, P99), error rates, CPU/memory utilization, pod restarts, and custom business KPIs. 🔍 2. Distributed Tracing with Jaeger Every request carries a trace_id across services revealing slow microservices, DB bottlenecks, or API latency within seconds. ⚡ Grafana dashboards integrate logs, metrics, and traces for all the visibility. ⚡Traces show complete request flows with span duration breakdowns, making bottleneck identification instant. 🔔 3. Smarter Alerts Integrated Prometheus Alertmanager and Opsgenie for context-aware notifications. The right person gets alerted at the right time — not the whole team at 3 AM. 🚨 🎯Alert rules based on SLO violations, anomaly detection, and threshold breaching with smart deduplication. 🤖 4. Infrastructure as Code Automated everything with Terraform, Helm, and Kubernetes, making observability stack reproducible, scalable, and also version controlled. No manual dashboards. No config drift. Just clean automation. 🚀 5. Results ✅ MTTR reduced by over 70% ✅ Logs, metrics & traces — all in one place ✅ Issues detected before users even notice 😴 ✅ Monitoring production clusters with 99.9% uptime The best time to build observability was before your last outage. The next best time is today. ⏰ 💬 What's your team's biggest challenge with observability or alert fatigue? I've worked with several teams in the region on this and always interesting to see different approaches. Stay tuned to see how elite SRE teams further eliminate most 3AM wake-up calls in next post of this series. 🚀 #DubaiTech #GCCTech #UAETech #KSA #DevOps #SRE #Kubernetes #CloudEngineering #Observability #Terraform #Grafana

3 Comments
Like Comment
To view or add a comment, sign in
Dinesh jaisankar

Transforming Banking Through Cloud Innovation | Azure Cloud Architect @ Vision Bank | Azure Architect | OpenShift & Kubernetes | Designing Resilient, Secure & Scalable Banking Infrastructure
1w Edited
Report this post
The Kubernetes Iceberg Nobody Talks About Most people think Kubernetes is just kubectl run nginx and Deployments. They’re wrong. ❄️ Kubernetes has layers — and most teams never go past the surface. 🌤️ Above the water: Pods, Deployments, ReplicaSets, ConfigMaps, Services Easy to learn. Easy to demo. Easy to believe you understand Kubernetes. 🌊 Below the water: StatefulSets, DaemonSets, NetworkPolicy, PodSecurityPolicy, GitOps, Cluster Autoscaler This is where real reliability, security, and scale are built. This is where teams either level up — or break production at 3 AM. 🌑 Deep water: Admission Controllers, Mutating Webhooks, Operators, CRDs, Service Mesh, Node Hardening This is where Kubernetes stops being just a container platform and becomes infrastructure engineering. Here’s the truth: Kubernetes isn’t hard. Partial Kubernetes is hard. The more you understand below the surface,the more control you gain above it. #DevOps #Kubernetes #CloudNative #SRE #PlatformEngineering #Containers #GitOps #Helm #Infra #Ops
Like Comment
To view or add a comment, sign in
Business Compass LLC

780 followers
1w
Report this post
Optimizing Terraform/OpenTofu State Management for Speed, Reliability, and Scalability https://lnkd.in/eH--qJBX Terraform and OpenTofu state files can make or break your infrastructure automation. Poor terraform state management leads to slow deployments, team conflicts, and risky infrastructure changes that keep DevOps engineers up at night.
Like Comment
To view or add a comment, sign in
Bhavuk Mudgal

Senior Site Reliability Engineer | 4k on LinkedIn
6d
Report this post
🚀 New Video: My Daily Tasks as a Senior SRE — Deploying an Application from Scratch Most people think SRE starts working only after an app is built. The reality is very different. In this video, I walk through how an SRE gets involved right from the moment leadership decides to build a new application. I take you step by step through the real lifecycle of getting a backend service from idea to production across AWS environments. You’ll see how I • design the AWS architecture that goes to the Architecture Review Board • support developers while they build the backend service • prepare dev, QA, and prod EKS clusters across multiple AWS accounts and regions • build the full CI/CD flow using GitLab CI, ArgoCD, ECR, and Snyk • run canary deployments across Europe and the US • enable monitoring with Prometheus and Grafana • follow the production release process with CRQs and validation checks To make it practical, I use a simple three-tier app example and focus on how the backend is deployed as pods into EKS. If you’ve ever wondered what an SRE actually does day to day, or how a real enterprise rollout works from scratch, this video breaks it down in a clear and realistic way. 🎥 Watch the full video here: https://lnkd.in/grUiRM27 If you find it useful, feel free to like, share, or drop your questions in the comments. Happy to help the community grow. #SRE #SiteReliabilityEngineering #DevOps #AWS #AWSEKS #Kubernetes #CloudEngineering #CloudArchitecture #GitLabCI #ArgoCD #CICD #ECR #Snyk #Observability #Prometheus #Grafana #Microservices #ProductionDeployment #CanaryDeployment #TechApricate #DevOpsEngineering #BackendEngineering #SoftwareDelivery #InfrastructureAutomation

My Daily Tasks as Senior Site Reliability Engineer SRE | How to Deploy an Application from Scratch

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Abheek Ranjan Das

Where AI meets DevOps meets Game Development — MSCS Grad | Generative AI, Docker, Unity, AWS | Ex-VOIS Engineer
1w
Report this post
Assumptions are your worst enemy when something breaks in production. Early in my DevOps journey, I learned this the hard way - we had an intermittent issue in a deployment pipeline, and everyone had a “theory.” Network latency? DNS? A misconfigured dependency? Turns out… it was a missing environment variable buried in a container. What fixed it wasn’t luck. It was observability - structured logging, metrics, and tracing. Once we could see what was happening, every guess became a fact. Good engineering is more about making things observable than just making them work. #DevOps #Observability
Like Comment
To view or add a comment, sign in
DevOps Bulletin

14,738 followers
5d
Report this post
This week’s DevOps Bulletin includes one of my favorite sections: the FinOps tip. Kubernetes clusters often burn 30 to 50 percent of compute because requests do not match real usage. The tip explains simple steps to fix that with tools like VPA, Goldilocks, and Kubecost. The rest of the issue covers the biggest stories in DevOps plus new tools such as CronMaster, ctop, Wave, Gerbil, and Kratos.
1 Comment
Like Comment
To view or add a comment, sign in
Ganeswar Velvadapu

Pre Final year student at IIT Hyderabad | Overall Head - Lambda IIT Hyderabad | Head of Web Development - E Cell IIT Hyderabad
1mo
Report this post
Second blog in my DevOps series is out, which covers Docker Networks in depth. Read it here: https://lnkd.in/gAT3FWwR I’ve explained how containers communicate and how different network types work. Please share your suggestions or corrections if you think I got something wrong.
Like Comment
To view or add a comment, sign in
ServerScribe

4 followers
1w
Report this post
At Backblaze, their entire storage architecture is built on a simple idea: keep the design boring, predictable, and transparent. That’s why their engineers can trace failures fast, recover even faster, and operate massive storage systems without drowning in complexity. In DevOps, sometimes the smartest thing you can do… is build something simple enough that it never surprises you. How simple is your infrastructure, really? 👇 #DevOps #ServerScribe #Backblaze #CloudStorage #Reliability #EngineeringCulture
Like Comment
To view or add a comment, sign in
Rupireddy Eswar

Assistant System Engineer
1w
Report this post
Day 51 of My DevOps Journey Today, I focused on the theoretical part of Kubernetes Services and explored their advantages in detail. What I Learned >>>Exposing Applications: Understood how to expose an application to the outside world using a LoadBalancer Service. Also learned how to expose applications within the internal network using a NodePort Service. >>>Service Discovery: Learned how Kubernetes automatically discovers services using labels and selectors. Understood what happens if labels and selectors don’t match — the Service won’t find any matching Pods, and traffic won’t be routed. >>>Traffic Distribution: Explored how traffic is evenly distributed among multiple Pods using Kubernetes’ built-in load balancing mechanism. Today was about grasping the theoretical foundation of Services , from exposure methods to traffic management and service discovery. Tomorrow, I plan to perform these concepts practically to strengthen my understanding. #DevOps #Kubernetes #K8s #Services #LoadBalancer #NodePort #ServiceDiscovery #PodCommunication #LoadBalancing #DevOpsJourney
Like Comment
To view or add a comment, sign in

590 followers

View Profile Connect

How to design observability pipelines with Vector by Datadog

More Relevant Posts

My Daily Tasks as Senior Site Reliability Engineer SRE | How to Deploy an Application from Scratch

https://www.youtube.com/

Explore content categories