Baseten used NVIDIA Dynamo to double inference speed for long-context code generation and increased throughput by 1.6x. Dynamo simplifies multi-node inference on Kubernetes, helping us scale deployments while reducing costs. Read the full blog ⏬ https://lnkd.in/e2_K33Y7
Baseten boosts code generation with NVIDIA Dynamo
More Relevant Posts
-
The ability to autoscale GPU workloads is critical to meet performance goals as well as optimize costs. This walkthrough shows one way to use the NVIDIA GPU Device Plugin add-on from OKE and common open-source telemetry tools to scale pods based on custom metrics relevant to AI/ML workloads. https://lnkd.in/eTiPiHX5
To view or add a comment, sign in
-
-
Grove is now part of NVIDIA Dynamo! Thrilled to share that Grove, a Kubernetes API for orchestrating modern #AI inference workloads, is now part of Dynamo as a modular, open-source component. As inference systems grow from single models to complex, multicomponent pipelines, scaling and coordination have become harder than ever. Grove makes it simple, defining your entire inference stack as one #Kubernetes resource that automatically handles scheduling, scaling, and topology-aware placement across thousands of GPUs. Now integrated with Dynamo, Grove brings a faster, more declarative way to run next-generation inference systems at scale. Explore the full story and step-by-step guide in our latest blog post. Link in comments below 👇
To view or add a comment, sign in
-
-
Coordinated scaling is extremely critical for getting performance when deploying a large-scale inference pipeline. To facilitate scalability, we integrated NVIDIA Grove into NVIDIA Dynamo! Learn more below!
Grove is now part of NVIDIA Dynamo! Thrilled to share that Grove, a Kubernetes API for orchestrating modern #AI inference workloads, is now part of Dynamo as a modular, open-source component. As inference systems grow from single models to complex, multicomponent pipelines, scaling and coordination have become harder than ever. Grove makes it simple, defining your entire inference stack as one #Kubernetes resource that automatically handles scheduling, scaling, and topology-aware placement across thousands of GPUs. Now integrated with Dynamo, Grove brings a faster, more declarative way to run next-generation inference systems at scale. Explore the full story and step-by-step guide in our latest blog post. Link in comments below 👇
To view or add a comment, sign in
-
-
Article about an addition to Dynamo, NVIDIA’s inference load balancer/optimizer. If you’re interested in deploying agents at scale, or even just want to understand the computational sequence of LLM executon across multiple GPU’s; Dynamo is worth studying.
Grove is now part of NVIDIA Dynamo! Thrilled to share that Grove, a Kubernetes API for orchestrating modern #AI inference workloads, is now part of Dynamo as a modular, open-source component. As inference systems grow from single models to complex, multicomponent pipelines, scaling and coordination have become harder than ever. Grove makes it simple, defining your entire inference stack as one #Kubernetes resource that automatically handles scheduling, scaling, and topology-aware placement across thousands of GPUs. Now integrated with Dynamo, Grove brings a faster, more declarative way to run next-generation inference systems at scale. Explore the full story and step-by-step guide in our latest blog post. Link in comments below 👇
To view or add a comment, sign in
-
-
Unlock the power of accelerated computing 🌟 with Azure Container Instances supporting GPU workloads! 🚀 Azure Container Instances (ACI) with GPU support brings the perfect solution for high-performance computing and machine learning tasks. Imagine the ease of deploying containers with the horsepower of a GPU, all without managing complex infrastructure. 🎉 With ACI, you can seamlessly scale your GPU-accelerated tasks in containers, accessing NVIDIA GPUs to handle intensive computational workloads efficiently. Real-world uses are vast – from AI models that need rapid prototyping, to running simulations or visual processing at scale. Have you integrated Azure Container Instances with GPU Workloads in your projects yet? What was your experience and any challenges faced? 🤔 #AzureContainerInstances #GPUWorkloads #CloudComputing #HighPerformanceComputing #AzureTech
To view or add a comment, sign in
-
-
Great news - Red Hat and NVIDIA formed an agreement to distribute the NVIDIA CUDA Toolkit across the Red Hat portfolio.
To view or add a comment, sign in
-
'Hewlett Packard Enterprise news from #Nvidia #GTC25 includes a new #PrivateCloud #AI developer kit, Nvidia AI blueprints, GPU optimization capabilities, and servers built with Nvidia Blackwell Ultra and Blackwell architecture.'
To view or add a comment, sign in
-
Most teams optimize GPU autoscaling for hyperscalers. But what about neoclouds? Private GPU clouds? Bare metal? The result: idle GPUs, complex ops, and infrastructure that doesn't scale across environments. Join Lukas Gentele at Cloud Native + Kubernetes AI Day for his keynote on autoscaling GPU clusters anywhere, hyperscalers, neoclouds, and bare metal. 📅 Monday, Nov 10 | 10:25am EST 📍 Building B | Level 4 | B401-402 Learn how vCluster integrated Karpenter with Terraform/OpenTofu, ClusterAPI, KubeVirt, and NVIDIA BCM to bring dynamic autoscaling to any environment. Reduce idle GPU time. Simplify operations. Run consistent AI infrastructure everywhere. See you in Atlanta! 🚀 #KubeCon #CloudNativeCon #CNK8sAIDay
To view or add a comment, sign in
-
-
I recently deployed several LLMs — Llama 3.2 (1B & 3B) and Mistral (7B) — on the NVIDIA Jetson AGX Thor, and I was genuinely amazed by its performance. The response speed was incredibly smooth — almost comparable to what you’d expect from cloud-hosted models like GPT-5 or Sonet-4.5. If you’re planning to run LLMs on the AGX Thor, I highly recommend using Ollama. I initially spent over 10 hours trying to get TensorRT-LLM running on the device and kept hitting one issue after another. With Ollama, everything just worked seamlessly. My setup: Platform: NVIDIA AGX Thor JetPack: R38.2 (Aug 2025 release) Architecture: ARM64 (aarch64) CUDA: 13.0 Driver: 580.00 GPU: NVIDIA Thor (SM 11.0) If you’re deploying LLMs via Docker or Kubernetes, you can use the following CUDA-optimized image for Ollama: `https://lnkd.in/gdmzCsRF The AGX Thor is truly redefining edge AI — it’s impressive to see LLMs running this fluidly on a local device. -------- If you are looking for a managed solution to host on-prem GPU servers, Spectro Cloud's Palette has it covered for you. - https://lnkd.in/gKAVTUFt - https://lnkd.in/gvPQPD_D #nvidia #thor #llm #inference #onprem #gpu #cloud
To view or add a comment, sign in
-
-
Learn how NVIDIA Run:ai's GPU optimization capabilities can be extended across #AWS hybrid & edge environments, from local zones to outposts racks & #EKS Hybrid Nodes 🌐🤖💪 https://go.aws/4oKE6j3 This solution's ability to support dynamic GPU fractions, node-level scheduling & priority-based sharing is valuable in edge scenarios where resource optimization & latency requirements are paramount. Read the blog to learn how to leverage these powerful features which have proven to improve GPU utilization from 25% to 75%.
To view or add a comment, sign in
-