Open Source LLMs, from vLLM to Production (Instruct.KR Summer Meetup, 2025)

Instruct.KR Summer 2025 Meetup
LLM, vLLM Production
Hyogeun Oh

Hygeun Oh
( , Zerohertz)
q / Machine Learning Engineer ( ) .
qPython Kubernetes , MLOps .
qNeovim .
q ML .
0. ...
[1] https://github.com/Zerohertz
GitHub[1]
1/39

0. ...
1. Introduction
2. OpenAI-Compatible Server
3. Architecture
4. Production Deployment
5. Wrap-up
2/39

Introduction
PART 1

Why Self-Host LLMs When Powerful Commercial APIs Exist?
q :
q / :
q : , ,
qAPI : , ,
1. Introduction
[2] https://artificialanalysis.ai/
[2]
4/39

Why Self-Host LLMs When Powerful Commercial APIs Exist?
qtransformers AutoModelForCausalLM ?
1. Introduction
5/39

Why vLLM?
qvLLM history
§ 2023 2 9 , GitHub CacheFlow[3]
§ 2023 6 17 , vLLM [4]
§ 2023 9 12 , UC Berkeley Sky Computing Lab “Efficient Memory Management for Large Language Model Serving with
PagedAttention” [5]
§ 2025 6 19 , GitHub star 50k
§ 2025 7 24 , v0.10.0 release
qLicense: Apache-2.0[6]
qv of vLLM[7]
1. Introduction
[3] https://github.com/vllm-project/vllm/commit/e7d9d9c08c79b386f6d0477e87b77a572390317d
[4] https://github.com/vllm-project/vllm/commit/0b98ba15c744f1dfb0ea4f2135e85ca23d572ae1
[5] https://github.com/vllm-project/vllm/blob/main/LICENSE
[6] https://arxiv.org/abs/2309.06180
[7] https://github.com/vllm-project/vllm/issues/835
6/39

Why vLLM?
1. Introduction
[8] https://www.star-history.com/
[8]
7/39

Why vLLM?
1. Introduction
[8] https://www.star-history.com/
[8]
8/39

Why vLLM?
q LLM
§ State-of-the-art throughput:
§ Continuous batching:
q
§ PagedAttention [6]: Attention key/value
§ Optimized CUDA kernels: FlashAttention[9], FlashInfer[10]
§ Chunked prefill [11], Speculative decoding[12]
q
§ HuggingFace :
§ : ,
§ : Tensor, pipeline (Ray ), data, expert parallelism
q API
§ OpenAI-Compatible API: AI (e.g., LangChain, Gemini CLI, …)
§ Streaming output:
§ Prefix caching, Multi-LoRA
1. Introduction
[9] https://github.com/vllm-project/flash-attention
[10] https://github.com/flashinfer-ai/flashinfer
[11] https://docs.vllm.ai/en/v0.9.2/configuration/optimization.html#chunked-prefill_1
[12] https://docs.vllm.ai/en/v0.9.2/features/spec_decode.html
9/39

How to serving LLM with vLLM?
qInstallation
§ Local (CPU)
§ GPU [13]
qvllm serve
1. Introduction
[13] https://docs.vllm.ai/en/v0.9.2/deployment/docker.html
10/39

OpenAI-Compaitble Server
PART 2

OpenAI API Spec[14]
q/v1/models[15]
[14] https://docs.vllm.ai/en/v0.9.2/serving/openai_compatible_server.html
[15] https://platform.openai.com/docs/api-reference/models/list
[16] https://platform.openai.com/docs/api-reference/chat/create
q/v1/chat/completions[16]
12/39

Tool calling
qOpenAI
[17] https://github.com/openai/openai-python/blob/v1.97.1/src/openai/types/shared_params/function_definition.py#L13-L45
[17]
13/39

Tool calling
q“--tool-call-parser” parser [18, 19]
[18] https://docs.vllm.ai/en/v0.9.2/features/tool_calling.html
[19] https://qwen.readthedocs.io/en/latest/deployment/vllm.html#parsing-tool-calls
14/39

Reasoning
q“chat_template_kwargs” “enable_thinking” reasoning
15/39

Reasoning
q“--reasoning-parser” parser [20, 21]
q “--enable-reasoning” deprecated[22]
[20] https://docs.vllm.ai/en/v0.9.2/features/reasoning_outputs.html
[21] https://qwen.readthedocs.io/en/latest/deployment/vllm.html#parsing-thinking-content
[22] https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/engine/arg_utils.py#L626-L634
16/39

Chat Template
q tokenizer_config.json “chat_template”
[23] https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/tokenizer_config.json#L230
[23]
17/39

Chat Template
q“--chat-template” chat template [24]
[24] https://docs.vllm.ai/en/v0.9.2/serving/openai_compatible_server.html#chat-template_1
[25] https://github.com/vllm-project/vllm/tree/main/examples
[25]
18/39

Architecture
PART 3

KV Cache, PagedAttention
qKV Cache[26]
§ Autoregressive token sequence
￫ time complexity: 𝑂 𝑛!
§ Key/Value (KV cache)
￫ time complexity: 𝑂 𝑛
qPagedAttention[6, 27, 28]
§ (OS) virtual memory
§ KV cache memory memory fragmentation
￫ memory /
§ Block memory page table (logical continuity) memory
§ page mapping memory non-continuous
3. Architecture
[26] https://huggingface.co/blog/not-lain/kv-caching
[27] https://docs.vllm.ai/en/v0.9.2/design/kernel/paged_attention.html
[28] https://blog.vllm.ai/2023/06/20/vllm.html
[29] https://github.com/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter03/Chapter%203%20-%20Looking%20Inside%20LLMs.ipynb
Value Token 1
Value Token 2
Value Token 3
Value Token 4
𝑉
Query Token 1
Query Token 2
Query Token 3
Query Token 4
𝑄
Key
Token
1
Key
Token
2
Key
Token
3
Key
Token
4
𝐾!
𝑄"𝐾"
𝑄!𝐾"
𝑄#𝐾"
𝑄$𝐾"
𝑄"𝐾!
𝑄!𝐾!
𝑄#𝐾!
𝑄$𝐾!
𝑄"𝐾#
𝑄!𝐾#
𝑄#𝐾#
𝑄$𝐾#
𝑄"𝐾$
𝑄!𝐾$
𝑄#𝐾$
𝑄$𝐾$
𝑄𝐾!
× = ×
[29]
[6]
20/39

V0 Engine vs. V1 Engine
q Optimized Execution Loop & API Server: EngineCore AsyncLLM API server, token CPU GPU model
q Simple & Flexible Scheduler: “prefill”, “decode” , “{request_id: num_tokens}” token chunked-prefill, caching, decoding
q Zero-Overhead Prefix Caching: Hash+LRU cache cache hit rate 0% 1%
q Clean Architecture for Tensor-Parallel Inference: Worker caching diff , scheduler, worker IPC overhead
q Efficient Input Preparation: Persistent batch tensor diff , Numpy CPU overhead
q torch.compile & Piecewise CUDA Graphs: Model CUDA graph kernel customizing
q Enhanced Support for Multimodal LLMs: image , image hash KV cache, encoder cache chunked-prefill
q FlashAttention 3: batch attention kernel
q VLLM_USE_V1=1 V1 Engine
3. Architecture
[30] https://docs.vllm.ai/en/v0.9.2/design/arch_overview.html
[31] https://docs.vllm.ai/en/v0.9.2/usage/v1_guide.html
[33] https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
21/39

Chat Completions
qvLLM /v1/chat/completions [34]
[34] https://zerohertz.github.io/vllm-openai-2/#Conclusion
22/39

Production Deployment
PART 4

LoRA Adapters
qLoRA (Low-Rank Adaptation)[35]
§ model weight low-rank matrix (𝐴, 𝐵)
§ Δ𝑊 = 𝐵𝐴 model data
qStatic serving LoRA adapters
qDynamic serving LoRA adapters
[35] https://docs.vllm.ai/en/v0.9.2/features/lora.html
[36]
24/39

LoRA Adapters
25/39

Parallelism Strategies[37]
qTensor Parallelism (TP)
§ model layer model parameter GPU
￫ Model GPU single node
￫ KV cache GPU memory pressure
qPipeline Parallelism (PP)
§ Model layer GPU model
￫ Model node
￫ Layer tensor sharding model
qExpert Parallelism (EP)
§ Mixture of Experts (MoE) model
￫ “--enable-expert-parallel” MoE layer tensor parallelism expert parallelism
￫ MoE model
￫ GPU expert
qData Parallelism (DP)
§ GPU model batch
￫ model GPU
￫ Model
￫ batch
[37] https://docs.vllm.ai/en/v0.9.2/configuration/optimization.html#parallelism-strategies
26/39

Multi-node Distributed Inference
qRay cluster [38]
§ VLLM_HOST_IP[39]: vLLM node IP
§ GLOO_SOCKET_IFNAME[40]: PyTorch Gloo bakcend network interface
§ NCCL_IB_DISABLE[41]: NCCL InfiniBand (IB) network
[38] https://docs.vllm.ai/en/v0.9.2/examples/online_serving/run_cluster.html
[39] https://docs.vllm.ai/en/v0.9.2/usage/security.html
[40] https://docs.pytorch.org/docs/stable/distributed.html#common-environment-variables
[41] https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-disable
Head Node
Worker Node
27/39

qMulti-Node Multi-GPU[42, 43]
(tensor parallel plus pipeline parallel inference)
§ If your model is too large to fit in a single node, you can use tensor
parallel together with pipeline parallelism.
§ The tensor parallel size is the number of GPUs you want to use in
each node, and the pipeline parallel size is the number of nodes
you want to use.
§ E.g., if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set
the tensor parallel size to 8 and the pipeline parallel size to 2.
qTP 16 vs. TP 8 + PP 2
[42] https://docs.vllm.ai/en/v0.9.2/serving/distributed_serving.html
[43] https://blog.vllm.ai/2025/02/17/distributed-inference.html
28/39

qRDMA (Remote Direct Memory Access)[44]
§ Network node memory CPU, cache, data
§ E.g., InfiniBand, RoCE, iWARP
qInfiniBand[45]
§ RDMA native CPU node memory
§ switch network adapter (Host Channel Adapter, HCA) (Ethernet network architecture)
qRoCEv2 (RDMA over Converged Ethernet v2)[44]
§ ethernet network RDMA protocol
[44] https://zerohertz.github.io/distributed-computing-rdma-roce/
[45] https://developer.nvidia.com/gpudirect
29/39

[46] https://docs.vllm.ai/en/v0.9.2/serving/distributed_serving.html#gpudirect-rdma
[46]
30/39

Production Deployment with GPU Cluster (Kubernetes)
qKubernetes [47]
§ Model store, Network, IPC,
qAIBrix[48, 49, 50, 51]
§ vLLM cloud native control plane
§ LoRA , LLM gateway routing, autoscaler, KV cache, …
qProduction-stack[52, 53, 54, 55]
§ vLLM LLM production codebase
§ LMCache KV cache , prefix-aware routing, Helm chart , …
[47] https://docs.vllm.ai/en/v0.9.2/deployment/k8s.html
[48] https://github.com/vllm-project/aibrix
[50] https://blog.vllm.ai/2025/02/21/aibrix-release.html
[51] https://aibrix.readthedocs.io/latest/
[52] https://github.com/vllm-project/production-stack
[53] https://docs.vllm.ai/en/v0.9.2/deployment/integrations/production-stack.html
[54] https://blog.vllm.ai/2025/01/21/stack-release.html
[55] https://blog.vllm.ai/production-stack/
31/39

Production Deployment with GPU Cluster (Kubernetes)
qLWS (LeaderWorkerSet)[56, 57]
§ Kubernetes Leader-Worker architecture
CRD controller
[56] https://docs.vllm.ai/en/v0.9.2/deployment/frameworks/lws.html
[57] https://github.com/kubernetes-sigs/lws
32/39

Observability (Prometheus + Grafana)
q“/metrics” vLLM server Prometheus format metric [58, 59]
§ vllm:request_success_total: (EOS max )
§ vllm:request_queue_time_seconds: queue
§ vllm:request_prefill_time_seconds: prefill
§ vllm:request_decode_time_seconds: decoding
§ vllm:request_max_num_generation_tokens: token
§ …
qGrafana dashboard
[58] https://docs.vllm.ai/en/v0.9.2/usage/metrics.html
[59] https://docs.vllm.ai/en/v0.9.2/examples/online_serving/prometheus_grafana.html
33/39

Benchmark
q“vllm bench” CLI benchmark [60]
[60] https://docs.vllm.ai/en/v0.9.2/cli/index.html#bench
[61] https://docs.vllm.ai/en/v0.9.2/contributing/profiling.html
[62] https://github.com/vllm-project/vllm/tree/v0.9.2/benchmarks
[63] https://github.com/vllm-project/guidellm
[64] https://arxiv.org/pdf/2502.06494
qvLLM repository benchmarks [61, 62]
qvLLM project Guidellm [63, 64]
34/39

Production Issues
qGloo connectFullMesh failed with…[65]
§ Multi node process GLOO
￫ “GLOO_SOCKET_IFNAME” interface
[65] https://github.com/vllm-project/vllm/discussions/11353
[66] https://github.com/vllm-project/vllm/issues/17652#issuecomment-2867891239
[67] https://flashinfer.ai/2024/02/02/cascade-inference.html
qQwen model token [66, 67]
§ FlashInfer kernel cascade inference
￫ “--disable-cascade-attn”
35/39

Wrap-up
PART 5

Roadmap Q3 2025[68]
qV1 Engine
§ V0 Engine
§ Core scheduler
§ Async scheduling, multi-modal
qLarge Scale Serving
§ Mixture-of-Experts (MoE) model scale-out serving
§ autoscaling
qModels
§ framework (training, authoring) tokenizer, configuration, processor
§ Sparse attention mechanism
§ 1B
qUse Cases
§ RLHF
￫ resharding
￫ Multi-turn scheduling
§ Evaluation
￫ Batching order full determinism (with/without prefix cache)
§ Batch Inference
￫ Prefix caching scale-out data parallel router
￫ CPU KV cache offloading
5. Wrap-up
37/39

Conclusion
qOpenAI-Compatible Server
§ LangChain, Gemini CLI
§ Tool Calling, Reasoning
qArchitecture
§ KV cache PagedAttention
§ V1 Engine
qProduction Deployment
§ TP/PP/DP/EP
§ Kubernetes multi-node
§ LoRA adapter serving
§ Prometheus + Grafana observability
§ vllm bench benchmark script
5. Wrap-up
38/39

vLLM Meetup in Korea[69, 70]
5. Wrap-up
[69] https://discuss.pytorch.kr/t/8-19-vllm-meetup/7401
[70] https://lu.ma/cgcgprmh
39/39

EoD
Coffee Chat
LinkedIn
GitHub

Open Source LLMs, from vLLM to Production (Instruct.KR Summer Meetup, 2025)

More Related Content

What's hot

Similar to Open Source LLMs, from vLLM to Production (Instruct.KR Summer Meetup, 2025)

Recently uploaded

Open Source LLMs, from vLLM to Production (Instruct.KR Summer Meetup, 2025)