Optimizing Large Language Models with vLLM and Related Tools.pdf

Optimizing Large Language Models with vLLM
and Related Tools
By - Tamanna
NextGen_Outlier 1

Introduction to LLM Optimization
Large Language Models (LLMs) like LLaMA and Mistral power AI applications but face
challenges:
High memory usage (e.g., LLaMA-13B needs 26GB in FP16).
Slow inference speeds for real-time tasks.
High computational costs.
vLLM: Open-source library for efficient LLM inference and serving.
Developed at UC Berkeley, 40,000+ GitHub stars.
Up to 24x higher throughput than Hugging Face Transformers.
NextGen_Outlier 2

What is vLLM?
vLLM (Virtual Large Language Model) optimizes LLM inference:
Reduces memory waste with PagedAttention.
Boosts throughput with continuous batching.
Supports quantization (e.g., FP8, INT8).
Compatible with Hugging Face models and OpenAI-style APIs.
Ideal for chatbots, code assistants, and text generation.
NextGen_Outlier 3

Core Features of vLLM
PagedAttention: Divides KV cache into blocks, reducing memory waste to <4%.
Example: Shares KV blocks for similar prompts (e.g., "What is the capital of...").
Continuous Batching: Dynamically processes requests, minimizing latency.
Quantization: Supports FP8, INT8, AWQ, GPTQ, bitsandbytes.
Example: LLaMA-13B from 26GB (FP16) to 13GB (INT8).
Optimized CUDA Kernels: Uses FlashAttention for speed.
Tensor Parallelism & Speculative Decoding: Scales across GPUs, predicts tokens faster.
NextGen_Outlier 4

vLLM Workflow
Diagram: [Placeholder for vLLM workflow diagram]
Workflow:
User sends prompt to vLLM API server.
AsyncLLM processes requests asynchronously.
EngineCore schedules with PagedAttention and continuous batching.
Quantized model runs on optimized CUDA kernels.
Output returned to user.
Note: Create diagram using tools like Mermaid or TikZ.
NextGen_Outlier 5

Using vLLM: Example (1/2)
Install: pip install vllm
Offline Inference:
from vllm import LLM, SamplingParams
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
llm = LLM(model=model_name, quantization="int8")
prompts = ["The future of AI is", "Write a short poem about the moon"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}n")
NextGen_Outlier 6

Using vLLM: Example (2/2)
Online Serving:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct --quantization int8 --port 8000
Docker Deployment:
docker run --runtime nvidia --gpus all -p 8000:8000 vllm/vllm-openai:latest
NextGen_Outlier 7

Other Tools Like vLLM
TensorRT-LLM (NVIDIA):
Optimized for NVIDIA GPUs, FP8 quantization.
High throughput (120–130 tok/s for LLaMA-7B).
Complex setup (model conversion).
DeepSpeed (Microsoft):
Mixed precision (FP16, BF16), ZeRO for distributed setups.
Ideal for large-scale training/inference.
OpenLLM (BentoML):
Focus on quantization (GPTQ, bitsandbytes) and fine-tuning.
Suited for memory-constrained environments.
TGI (Hugging Face):
Continuous batching, FlashAttention.
2.2x–2.5x less throughput than vLLM.
NextGen_Outlier 8

Additional Tools for LLM Optimization
ONNX Runtime (Microsoft):
Cross-platform (CPUs, GPUs, TPUs), INT8/FP16 quantization.
Use case: Edge devices, mixed hardware.
Llama.cpp:
CPU-optimized, 4-bit/5-bit quantization (Q4_K_M).
Example: Run Mistral-7B on a laptop (4.5GB).
ExLlamaV2:
4-bit GPTQ on NVIDIA GPUs, ~100 tok/s for LLaMA-7B.
Lightweight, single-GPU focus.
Aphrodite-Engine:
vLLM fork with speculative decoding.
Experimental, marginal speed gains.
NextGen_Outlier 9

Quantization Techniques
PTQ (Post-Training Quantization): Fast, slight accuracy loss.
Example: INT8 in vLLM.
QAT (Quantization-Aware Training): High accuracy, resource-intensive.
AWQ: Activation-aware, fast on NVIDIA GPUs.
GPTQ: 4-bit/8-bit quantization, memory-efficient.
Bitsandbytes: 8-bit with hardware acceleration.
Q4_K_M (Llama.cpp): 4-bit for CPU inference.
Example: ./main -m mistral-7b-q4_k_m.gguf -p "Tell me a story"
NextGen_Outlier 10

Additional Optimization Techniques
Model Distillation:
Trains smaller "student" model to mimic larger "teacher."
Use case: Edge deployment.
Example: Distill LLaMA-13B to 3B.
LoRA (Low-Rank Adaptation):
Fine-tunes small parameter subset.
Example in vLLM:
from vllm import LLM
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", lora_path="lora-adapter")
NextGen_Outlier 11

Comparison of Tools
Feature vLLM
TensorRT-
LLM
DeepSpeed Llama.cpp
Focus
High-throughput
serving
GPU
inference
Training/Inference
CPU/Edge
inference
Optimization
PagedAttention,
Batching
FP8, Kernels
ZeRO, Mixed
Precision
Q4/Q5
Quantization
Throughput 24x vs. HF 120–130 tok/s High (distributed) 10–20 tok/s
Quantization
FP8, INT8, AWQ,
GPTQ
FP8, INT8 FP16, BF16, INT8 Q4, Q5, Q8
Hardware NVIDIA, AMD, CPUs NVIDIA GPUs Multi-GPU, CPUs CPUs, Edge
Use Case Chatbots, APIs
NVIDIA
setups
Large-scale setups Edge devices
NextGen_Outlier 12

Practical Example: Chatbot Deployment
GPU with vLLM (Mistral-7B, INT8):
vllm serve mistralai/Mistral-7B-v0.1 --quantization int8 --port 8000
Memory: 7GB (vs. 14GB FP16).
Output: "Why did the scarecrow become a motivational speaker? ..."
CPU with Llama.cpp (Mistral-7B, Q4_K_M):
./main -m mistral-7b-q4_k_m.gguf -p "Tell me a joke!" -n 100
Memory: ~4.5GB, runs on 16GB RAM laptop.
NextGen_Outlier 13

Conclusion
vLLM: Top choice for high-throughput LLM serving with PagedAttention and quantization.
Alternatives: TensorRT-LLM (NVIDIA GPUs), DeepSpeed (distributed), OpenLLM (fine-tuning),
TGI (Hugging Face), ONNX Runtime (cross-platform), Llama.cpp (CPU/edge), ExLlamaV2 (4-bit
GPTQ), Aphrodite-Engine (experimental).
Techniques: Quantization (PTQ, AWQ, Q4_K_M), distillation, LoRA.
Deploy LLMs efficiently on GPUs, CPUs, or edge devices.
Resources: vLLM documentation, Llama.cpp GitHub, Runpod.
NextGen_Outlier 14

Thank you!!
NextGen_Outlier 15

Optimizing Large Language Models with vLLM and Related Tools.pdf

More Related Content

Similar to Optimizing Large Language Models with vLLM and Related Tools.pdf

More from Tamanna

Recently uploaded

Optimizing Large Language Models with vLLM and Related Tools.pdf