Great update from the vLLM team — the new plugin system is a big step forward for flexible, upstream-safe LLM serving. This aligns perfectly with ZFLOW AI’s simulation-driven approach. We need a serving engine that can be customized (scheduling, KV-cache behavior, model variants) without forking. vLLM’s plugin layer gives us exactly that. This will let us integrate custom optimization logic directly into our simulation stack and deploy the same logic in production through ZFLOW Serve — closing the loop between simulation → optimization → real deployment. Excited to explore deeper synergy here. #AIInfrastructure #LLMInference #vLLM #ZFLOWAI
Need to customize vLLM? Don't fork it. 🔌 vLLM's plugin system lets you inject surgical modifications without maintaining a fork or monkey-patching entire modules. Blog by Dhruvil Bhatt from AWS SageMaker 👇 Why plugins > forks: • vLLM releases every 2 weeks with 100s of PRs merged • Forks require constant rebasing & conflict resolution • Monkey patches break on every vLLM upgrade How it works: • Use VLLMPatch[TargetClass] for precise, class-level mods • Register via vllm.general_plugins entry point • Control patches with env vars (VLLM_CUSTOM_PATCHES) • Version-guard with @min_vllm_version decorator Example: Add priority scheduling to vLLM's scheduler in ~20 lines. One Docker image serves multiple models with different patches enabled via environment variables. The plugin loads in ALL vLLM processes (main, workers, GPU/CPU) before any inference starts—ensuring consistent behavior across distributed setups. Read the full implementation guide with code examples: https://lnkd.in/e4U_xeFa