Observability is essential to ensuring the reliability of distributed systems -especially when those systems run in environments you don’t control. In his blog post, Surya Nittala, Sr. DevOps Engineer at DoubleVerify, shares a step-by-step guide to designing observability pipelines using Vector by Datadog. The post explains how the team collects and routes Prometheus metrics from Kubernetes clusters to a centralized Prometheus instance-efficiently and securely. Read more here: https://lnkd.in/dwjbH3vQ Surya N.
How to design observability pipelines with Vector by Datadog
More Relevant Posts
-
Hi During my recent work on observability and high-availability metrics, I explored Thanos deeply. I’ve put together a document that explains everything I learned — from architecture to deployment and common pitfalls. I hope it helps other DevOps engineers who are trying to scale Prometheus and centralize metrics storage.
To view or add a comment, sign in
-
💥 DevOps Nightmare Series #2: See Everything Clearly 🌐 Most teams think they have monitoring. Few truly have observability. And that difference decides whether you're asleep at 3 AM — or awake firefighting. 😰 Here is how I built end to end visibility using Loki, Prometheus, Grafana, and Jaeger across multi cluster Kubernetes. ☁️ 👀 1. Logs & Metrics — The Foundation Started with Loki for centralized logs and Prometheus for metrics. Unified both inside Grafana giving me all metrics, logs, and alerts all in one view to troubleshoot faster.. The one thing you cannot fix is the one you cannot see. 📊Key metrics tracked: Request latency (P95, P99), error rates, CPU/memory utilization, pod restarts, and custom business KPIs. 🔍 2. Distributed Tracing with Jaeger Every request carries a trace_id across services revealing slow microservices, DB bottlenecks, or API latency within seconds. ⚡ Grafana dashboards integrate logs, metrics, and traces for all the visibility. ⚡Traces show complete request flows with span duration breakdowns, making bottleneck identification instant. 🔔 3. Smarter Alerts Integrated Prometheus Alertmanager and Opsgenie for context-aware notifications. The right person gets alerted at the right time — not the whole team at 3 AM. 🚨 🎯Alert rules based on SLO violations, anomaly detection, and threshold breaching with smart deduplication. 🤖 4. Infrastructure as Code Automated everything with Terraform, Helm, and Kubernetes, making observability stack reproducible, scalable, and also version controlled. No manual dashboards. No config drift. Just clean automation. 🚀 5. Results ✅ MTTR reduced by over 70% ✅ Logs, metrics & traces — all in one place ✅ Issues detected before users even notice 😴 ✅ Monitoring production clusters with 99.9% uptime The best time to build observability was before your last outage. The next best time is today. ⏰ 💬 What's your team's biggest challenge with observability or alert fatigue? I've worked with several teams in the region on this and always interesting to see different approaches. Stay tuned to see how elite SRE teams further eliminate most 3AM wake-up calls in next post of this series. 🚀 #DubaiTech #GCCTech #UAETech #KSA #DevOps #SRE #Kubernetes #CloudEngineering #Observability #Terraform #Grafana
To view or add a comment, sign in
-
The Kubernetes Iceberg Nobody Talks About Most people think Kubernetes is just kubectl run nginx and Deployments. They’re wrong. ❄️ Kubernetes has layers — and most teams never go past the surface. 🌤️ Above the water: Pods, Deployments, ReplicaSets, ConfigMaps, Services Easy to learn. Easy to demo. Easy to believe you understand Kubernetes. 🌊 Below the water: StatefulSets, DaemonSets, NetworkPolicy, PodSecurityPolicy, GitOps, Cluster Autoscaler This is where real reliability, security, and scale are built. This is where teams either level up — or break production at 3 AM. 🌑 Deep water: Admission Controllers, Mutating Webhooks, Operators, CRDs, Service Mesh, Node Hardening This is where Kubernetes stops being just a container platform and becomes infrastructure engineering. Here’s the truth: Kubernetes isn’t hard. Partial Kubernetes is hard. The more you understand below the surface,the more control you gain above it. #DevOps #Kubernetes #CloudNative #SRE #PlatformEngineering #Containers #GitOps #Helm #Infra #Ops
To view or add a comment, sign in
-
-
Optimizing Terraform/OpenTofu State Management for Speed, Reliability, and Scalability https://lnkd.in/eH--qJBX Terraform and OpenTofu state files can make or break your infrastructure automation. Poor terraform state management leads to slow deployments, team conflicts, and risky infrastructure changes that keep DevOps engineers up at night.
To view or add a comment, sign in
-
🚀 New Video: My Daily Tasks as a Senior SRE — Deploying an Application from Scratch Most people think SRE starts working only after an app is built. The reality is very different. In this video, I walk through how an SRE gets involved right from the moment leadership decides to build a new application. I take you step by step through the real lifecycle of getting a backend service from idea to production across AWS environments. You’ll see how I • design the AWS architecture that goes to the Architecture Review Board • support developers while they build the backend service • prepare dev, QA, and prod EKS clusters across multiple AWS accounts and regions • build the full CI/CD flow using GitLab CI, ArgoCD, ECR, and Snyk • run canary deployments across Europe and the US • enable monitoring with Prometheus and Grafana • follow the production release process with CRQs and validation checks To make it practical, I use a simple three-tier app example and focus on how the backend is deployed as pods into EKS. If you’ve ever wondered what an SRE actually does day to day, or how a real enterprise rollout works from scratch, this video breaks it down in a clear and realistic way. 🎥 Watch the full video here: https://lnkd.in/grUiRM27 If you find it useful, feel free to like, share, or drop your questions in the comments. Happy to help the community grow. #SRE #SiteReliabilityEngineering #DevOps #AWS #AWSEKS #Kubernetes #CloudEngineering #CloudArchitecture #GitLabCI #ArgoCD #CICD #ECR #Snyk #Observability #Prometheus #Grafana #Microservices #ProductionDeployment #CanaryDeployment #TechApricate #DevOpsEngineering #BackendEngineering #SoftwareDelivery #InfrastructureAutomation
My Daily Tasks as Senior Site Reliability Engineer SRE | How to Deploy an Application from Scratch
https://www.youtube.com/
To view or add a comment, sign in
-
Assumptions are your worst enemy when something breaks in production. Early in my DevOps journey, I learned this the hard way - we had an intermittent issue in a deployment pipeline, and everyone had a “theory.” Network latency? DNS? A misconfigured dependency? Turns out… it was a missing environment variable buried in a container. What fixed it wasn’t luck. It was observability - structured logging, metrics, and tracing. Once we could see what was happening, every guess became a fact. Good engineering is more about making things observable than just making them work. #DevOps #Observability
To view or add a comment, sign in
-
This week’s DevOps Bulletin includes one of my favorite sections: the FinOps tip. Kubernetes clusters often burn 30 to 50 percent of compute because requests do not match real usage. The tip explains simple steps to fix that with tools like VPA, Goldilocks, and Kubecost. The rest of the issue covers the biggest stories in DevOps plus new tools such as CronMaster, ctop, Wave, Gerbil, and Kratos.
To view or add a comment, sign in
-
-
Second blog in my DevOps series is out, which covers Docker Networks in depth. Read it here: https://lnkd.in/gAT3FWwR I’ve explained how containers communicate and how different network types work. Please share your suggestions or corrections if you think I got something wrong.
To view or add a comment, sign in
-
-
At Backblaze, their entire storage architecture is built on a simple idea: keep the design boring, predictable, and transparent. That’s why their engineers can trace failures fast, recover even faster, and operate massive storage systems without drowning in complexity. In DevOps, sometimes the smartest thing you can do… is build something simple enough that it never surprises you. How simple is your infrastructure, really? 👇 #DevOps #ServerScribe #Backblaze #CloudStorage #Reliability #EngineeringCulture
To view or add a comment, sign in
-
Day 51 of My DevOps Journey Today, I focused on the theoretical part of Kubernetes Services and explored their advantages in detail. What I Learned >>>Exposing Applications: Understood how to expose an application to the outside world using a LoadBalancer Service. Also learned how to expose applications within the internal network using a NodePort Service. >>>Service Discovery: Learned how Kubernetes automatically discovers services using labels and selectors. Understood what happens if labels and selectors don’t match — the Service won’t find any matching Pods, and traffic won’t be routed. >>>Traffic Distribution: Explored how traffic is evenly distributed among multiple Pods using Kubernetes’ built-in load balancing mechanism. Today was about grasping the theoretical foundation of Services , from exposure methods to traffic management and service discovery. Tomorrow, I plan to perform these concepts practically to strengthen my understanding. #DevOps #Kubernetes #K8s #Services #LoadBalancer #NodePort #ServiceDiscovery #PodCommunication #LoadBalancing #DevOpsJourney
To view or add a comment, sign in