Optimizing Large Language Models with vLLM
and Related Tools
By - Tamanna
NextGen_Outlier 1
Introduction to LLM Optimization
Large Language Models (LLMs) like LLaMA and Mistral power AI applications but face
challenges:
High memory usage (e.g., LLaMA-13B needs 26GB in FP16).
Slow inference speeds for real-time tasks.
High computational costs.
vLLM: Open-source library for efficient LLM inference and serving.
Developed at UC Berkeley, 40,000+ GitHub stars.
Up to 24x higher throughput than Hugging Face Transformers.
NextGen_Outlier 2
What is vLLM?
vLLM (Virtual Large Language Model) optimizes LLM inference:
Reduces memory waste with PagedAttention.
Boosts throughput with continuous batching.
Supports quantization (e.g., FP8, INT8).
Compatible with Hugging Face models and OpenAI-style APIs.
Ideal for chatbots, code assistants, and text generation.
NextGen_Outlier 3
Core Features of vLLM
PagedAttention: Divides KV cache into blocks, reducing memory waste to <4%.
Example: Shares KV blocks for similar prompts (e.g., "What is the capital of...").
Continuous Batching: Dynamically processes requests, minimizing latency.
Quantization: Supports FP8, INT8, AWQ, GPTQ, bitsandbytes.
Example: LLaMA-13B from 26GB (FP16) to 13GB (INT8).
Optimized CUDA Kernels: Uses FlashAttention for speed.
Tensor Parallelism & Speculative Decoding: Scales across GPUs, predicts tokens faster.
NextGen_Outlier 4
vLLM Workflow
Diagram: [Placeholder for vLLM workflow diagram]
Workflow:
User sends prompt to vLLM API server.
AsyncLLM processes requests asynchronously.
EngineCore schedules with PagedAttention and continuous batching.
Quantized model runs on optimized CUDA kernels.
Output returned to user.
Note: Create diagram using tools like Mermaid or TikZ.
NextGen_Outlier 5
Using vLLM: Example (1/2)
Install: pip install vllm
Offline Inference:
from vllm import LLM, SamplingParams
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
llm = LLM(model=model_name, quantization="int8")
prompts = ["The future of AI is", "Write a short poem about the moon"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}n")
NextGen_Outlier 6
Using vLLM: Example (2/2)
Online Serving:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --quantization int8 --port 8000
Docker Deployment:
docker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest
NextGen_Outlier 7
Other Tools Like vLLM
TensorRT-LLM (NVIDIA):
Optimized for NVIDIA GPUs, FP8 quantization.
High throughput (120–130 tok/s for LLaMA-7B).
Complex setup (model conversion).
DeepSpeed (Microsoft):
Mixed precision (FP16, BF16), ZeRO for distributed setups.
Ideal for large-scale training/inference.
OpenLLM (BentoML):
Focus on quantization (GPTQ, bitsandbytes) and fine-tuning.
Suited for memory-constrained environments.
TGI (Hugging Face):
Continuous batching, FlashAttention.
2.2x–2.5x less throughput than vLLM.
NextGen_Outlier 8
Additional Tools for LLM Optimization
ONNX Runtime (Microsoft):
Cross-platform (CPUs, GPUs, TPUs), INT8/FP16 quantization.
Use case: Edge devices, mixed hardware.
Llama.cpp:
CPU-optimized, 4-bit/5-bit quantization (Q4_K_M).
Example: Run Mistral-7B on a laptop (4.5GB).
ExLlamaV2:
4-bit GPTQ on NVIDIA GPUs, ~100 tok/s for LLaMA-7B.
Lightweight, single-GPU focus.
Aphrodite-Engine:
vLLM fork with speculative decoding.
Experimental, marginal speed gains.
NextGen_Outlier 9
Quantization Techniques
PTQ (Post-Training Quantization): Fast, slight accuracy loss.
Example: INT8 in vLLM.
QAT (Quantization-Aware Training): High accuracy, resource-intensive.
AWQ: Activation-aware, fast on NVIDIA GPUs.
GPTQ: 4-bit/8-bit quantization, memory-efficient.
Bitsandbytes: 8-bit with hardware acceleration.
Q4_K_M (Llama.cpp): 4-bit for CPU inference.
Example: ./main -m mistral-7b-q4_k_m.gguf -p "Tell me a story"
NextGen_Outlier 10
Additional Optimization Techniques
Model Distillation:
Trains smaller "student" model to mimic larger "teacher."
Use case: Edge deployment.
Example: Distill LLaMA-13B to 3B.
LoRA (Low-Rank Adaptation):
Fine-tunes small parameter subset.
Example in vLLM:
from vllm import LLM
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", lora_path="lora-adapter")
NextGen_Outlier 11
Comparison of Tools
Feature vLLM
TensorRT-
LLM
DeepSpeed Llama.cpp
Focus
High-throughput
serving
GPU
inference
Training/Inference
CPU/Edge
inference
Optimization
PagedAttention,
Batching
FP8, Kernels
ZeRO, Mixed
Precision
Q4/Q5
Quantization
Throughput 24x vs. HF 120–130 tok/s High (distributed) 10–20 tok/s
Quantization
FP8, INT8, AWQ,
GPTQ
FP8, INT8 FP16, BF16, INT8 Q4, Q5, Q8
Hardware NVIDIA, AMD, CPUs NVIDIA GPUs Multi-GPU, CPUs CPUs, Edge
Use Case Chatbots, APIs
NVIDIA
setups
Large-scale setups Edge devices
NextGen_Outlier 12
Practical Example: Chatbot Deployment
GPU with vLLM (Mistral-7B, INT8):
vllm serve mistralai/Mistral-7B-v0.1 --quantization int8 --port 8000
Memory: 7GB (vs. 14GB FP16).
Output: "Why did the scarecrow become a motivational speaker? ..."
CPU with Llama.cpp (Mistral-7B, Q4_K_M):
./main -m mistral-7b-q4_k_m.gguf -p "Tell me a joke!" -n 100
Memory: ~4.5GB, runs on 16GB RAM laptop.
NextGen_Outlier 13
Conclusion
vLLM: Top choice for high-throughput LLM serving with PagedAttention and quantization.
Alternatives: TensorRT-LLM (NVIDIA GPUs), DeepSpeed (distributed), OpenLLM (fine-tuning),
TGI (Hugging Face), ONNX Runtime (cross-platform), Llama.cpp (CPU/edge), ExLlamaV2 (4-bit
GPTQ), Aphrodite-Engine (experimental).
Techniques: Quantization (PTQ, AWQ, Q4_K_M), distillation, LoRA.
Deploy LLMs efficiently on GPUs, CPUs, or edge devices.
Resources: vLLM documentation, Llama.cpp GitHub, Runpod.
NextGen_Outlier 14
Thank you!!
NextGen_Outlier 15
NextGen_Outlier 16

Optimizing Large Language Models with vLLM and Related Tools.pdf

  • 1.
    Optimizing Large LanguageModels with vLLM and Related Tools By - Tamanna NextGen_Outlier 1
  • 2.
    Introduction to LLMOptimization Large Language Models (LLMs) like LLaMA and Mistral power AI applications but face challenges: High memory usage (e.g., LLaMA-13B needs 26GB in FP16). Slow inference speeds for real-time tasks. High computational costs. vLLM: Open-source library for efficient LLM inference and serving. Developed at UC Berkeley, 40,000+ GitHub stars. Up to 24x higher throughput than Hugging Face Transformers. NextGen_Outlier 2
  • 3.
    What is vLLM? vLLM(Virtual Large Language Model) optimizes LLM inference: Reduces memory waste with PagedAttention. Boosts throughput with continuous batching. Supports quantization (e.g., FP8, INT8). Compatible with Hugging Face models and OpenAI-style APIs. Ideal for chatbots, code assistants, and text generation. NextGen_Outlier 3
  • 4.
    Core Features ofvLLM PagedAttention: Divides KV cache into blocks, reducing memory waste to <4%. Example: Shares KV blocks for similar prompts (e.g., "What is the capital of..."). Continuous Batching: Dynamically processes requests, minimizing latency. Quantization: Supports FP8, INT8, AWQ, GPTQ, bitsandbytes. Example: LLaMA-13B from 26GB (FP16) to 13GB (INT8). Optimized CUDA Kernels: Uses FlashAttention for speed. Tensor Parallelism & Speculative Decoding: Scales across GPUs, predicts tokens faster. NextGen_Outlier 4
  • 5.
    vLLM Workflow Diagram: [Placeholderfor vLLM workflow diagram] Workflow: User sends prompt to vLLM API server. AsyncLLM processes requests asynchronously. EngineCore schedules with PagedAttention and continuous batching. Quantized model runs on optimized CUDA kernels. Output returned to user. Note: Create diagram using tools like Mermaid or TikZ. NextGen_Outlier 5
  • 6.
    Using vLLM: Example(1/2) Install: pip install vllm Offline Inference: from vllm import LLM, SamplingParams model_name = "meta-llama/Meta-Llama-3-8B-Instruct" llm = LLM(model=model_name, quantization="int8") prompts = ["The future of AI is", "Write a short poem about the moon"] sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50) outputs = llm.generate(prompts, sampling_params) for output in outputs: print(f"Prompt: {output.prompt}") print(f"Output: {output.outputs[0].text}n") NextGen_Outlier 6
  • 7.
    Using vLLM: Example(2/2) Online Serving: vllm serve meta-llama/Meta-Llama-3-8B-Instruct --quantization int8 --port 8000 Docker Deployment: docker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest NextGen_Outlier 7
  • 8.
    Other Tools LikevLLM TensorRT-LLM (NVIDIA): Optimized for NVIDIA GPUs, FP8 quantization. High throughput (120–130 tok/s for LLaMA-7B). Complex setup (model conversion). DeepSpeed (Microsoft): Mixed precision (FP16, BF16), ZeRO for distributed setups. Ideal for large-scale training/inference. OpenLLM (BentoML): Focus on quantization (GPTQ, bitsandbytes) and fine-tuning. Suited for memory-constrained environments. TGI (Hugging Face): Continuous batching, FlashAttention. 2.2x–2.5x less throughput than vLLM. NextGen_Outlier 8
  • 9.
    Additional Tools forLLM Optimization ONNX Runtime (Microsoft): Cross-platform (CPUs, GPUs, TPUs), INT8/FP16 quantization. Use case: Edge devices, mixed hardware. Llama.cpp: CPU-optimized, 4-bit/5-bit quantization (Q4_K_M). Example: Run Mistral-7B on a laptop (4.5GB). ExLlamaV2: 4-bit GPTQ on NVIDIA GPUs, ~100 tok/s for LLaMA-7B. Lightweight, single-GPU focus. Aphrodite-Engine: vLLM fork with speculative decoding. Experimental, marginal speed gains. NextGen_Outlier 9
  • 10.
    Quantization Techniques PTQ (Post-TrainingQuantization): Fast, slight accuracy loss. Example: INT8 in vLLM. QAT (Quantization-Aware Training): High accuracy, resource-intensive. AWQ: Activation-aware, fast on NVIDIA GPUs. GPTQ: 4-bit/8-bit quantization, memory-efficient. Bitsandbytes: 8-bit with hardware acceleration. Q4_K_M (Llama.cpp): 4-bit for CPU inference. Example: ./main -m mistral-7b-q4_k_m.gguf -p "Tell me a story" NextGen_Outlier 10
  • 11.
    Additional Optimization Techniques ModelDistillation: Trains smaller "student" model to mimic larger "teacher." Use case: Edge deployment. Example: Distill LLaMA-13B to 3B. LoRA (Low-Rank Adaptation): Fine-tunes small parameter subset. Example in vLLM: from vllm import LLM llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", lora_path="lora-adapter") NextGen_Outlier 11
  • 12.
    Comparison of Tools FeaturevLLM TensorRT- LLM DeepSpeed Llama.cpp Focus High-throughput serving GPU inference Training/Inference CPU/Edge inference Optimization PagedAttention, Batching FP8, Kernels ZeRO, Mixed Precision Q4/Q5 Quantization Throughput 24x vs. HF 120–130 tok/s High (distributed) 10–20 tok/s Quantization FP8, INT8, AWQ, GPTQ FP8, INT8 FP16, BF16, INT8 Q4, Q5, Q8 Hardware NVIDIA, AMD, CPUs NVIDIA GPUs Multi-GPU, CPUs CPUs, Edge Use Case Chatbots, APIs NVIDIA setups Large-scale setups Edge devices NextGen_Outlier 12
  • 13.
    Practical Example: ChatbotDeployment GPU with vLLM (Mistral-7B, INT8): vllm serve mistralai/Mistral-7B-v0.1 --quantization int8 --port 8000 Memory: 7GB (vs. 14GB FP16). Output: "Why did the scarecrow become a motivational speaker? ..." CPU with Llama.cpp (Mistral-7B, Q4_K_M): ./main -m mistral-7b-q4_k_m.gguf -p "Tell me a joke!" -n 100 Memory: ~4.5GB, runs on 16GB RAM laptop. NextGen_Outlier 13
  • 14.
    Conclusion vLLM: Top choicefor high-throughput LLM serving with PagedAttention and quantization. Alternatives: TensorRT-LLM (NVIDIA GPUs), DeepSpeed (distributed), OpenLLM (fine-tuning), TGI (Hugging Face), ONNX Runtime (cross-platform), Llama.cpp (CPU/edge), ExLlamaV2 (4-bit GPTQ), Aphrodite-Engine (experimental). Techniques: Quantization (PTQ, AWQ, Q4_K_M), distillation, LoRA. Deploy LLMs efficiently on GPUs, CPUs, or edge devices. Resources: vLLM documentation, Llama.cpp GitHub, Runpod. NextGen_Outlier 14
  • 15.
  • 16.