Tips for Optimizing LLM Performance

Explore top LinkedIn content from expert professionals.

Summary

Improving the performance of large language models (LLMs) involves refining their speed, efficiency, and cost-effectiveness, focusing on key areas like prompt design, model structure, and system-level infrastructure. By optimizing these elements, organizations can achieve better results in AI-driven applications while reducing latency and resource consumption.

  • Revise input and output design: Streamline inputs by summarizing or compressing prompts and organizing outputs to reduce computation time and resource use.
  • Apply model compression: Utilize techniques like quantization and knowledge distillation to reduce the size and memory needs of LLMs without compromising performance.
  • Utilize efficient infrastructure: Invest in system-level optimizations, such as memory management and batching, or implement speculative decoding to enhance processing speed and manage costs.
Summarized by AI based on LinkedIn member posts
  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    595,262 followers

    If you’re an AI engineer trying to optimize your LLMs for inference, here’s a quick guide for you 👇 Efficient inference isn’t just about faster hardware, it’s a multi-layered design problem. From how you compress prompts to how your memory is managed across GPUs, everything impacts latency, throughput, and cost. Here’s a structured taxonomy of inference-time optimizations for LLMs: 1. Data-Level Optimization Reduce redundant tokens and unnecessary output computation. → Input Compression:  - Prompt Pruning, remove irrelevant history or system tokens  - Prompt Summarization, use model-generated summaries as input  - Soft Prompt Compression, encode static context using embeddings  - RAG, replace long prompts with retrieved documents plus compact queries → Output Organization:  - Pre-structure output to reduce decoding time and minimize sampling steps 2. Model-Level Optimization (a) Efficient Structure Design → Efficient FFN Design, use gated or sparsely-activated FFNs (e.g., SwiGLU) → Efficient Attention, FlashAttention, linear attention, or sliding window for long context → Transformer Alternates, e.g., Mamba, Reformer for memory-efficient decoding → Multi/Group-Query Attention, share keys/values across heads to reduce KV cache size → Low-Complexity Attention, replace full softmax with approximations (e.g., Linformer) (b) Model Compression → Quantization:  - Post-Training, no retraining needed  - Quantization-Aware Training, better accuracy, especially <8-bit → Sparsification:  - Weight Pruning, Sparse Attention → Structure Optimization:  - Neural Architecture Search, Structure Factorization → Knowledge Distillation:  - White-box, student learns internal states  - Black-box, student mimics output logits → Dynamic Inference, adaptive early exits or skipping blocks based on input complexity 3. System-Level Optimization (a) Inference Engine → Graph & Operator Optimization, use ONNX, TensorRT, BetterTransformer for op fusion → Speculative Decoding, use a smaller model to draft tokens, validate with full model → Memory Management, KV cache reuse, paging strategies (e.g., PagedAttention in vLLM) (b) Serving System → Batching, group requests with similar lengths for throughput gains → Scheduling, token-level preemption (e.g., TGI, vLLM schedulers) → Distributed Systems, use tensor, pipeline, or model parallelism to scale across GPUs My Two Cents 🫰 → Always benchmark end-to-end latency, not just token decode speed → For production, 8-bit or 4-bit quantized models with MQA and PagedAttention give the best price/performance → If using long context (>64k), consider sliding attention plus RAG, not full dense memory → Use speculative decoding and batching for chat applications with high concurrency → LLM inference is a systems problem. Optimizing it requires thinking holistically, from tokens to tensors to threads. Image inspo: A Survey on Efficient Inference for Large Language Models ---- Follow me (Aishwarya Srinivasan) for more AI insights!

  • View profile for Sriram Natarajan

    Sr. Director @ GEICO | Ex-Google | TEDx Speaker | AI & Tech Advisor

    3,436 followers

    When working with 𝗟𝗟𝗠𝘀, most discussions revolve around improving 𝗺𝗼𝗱𝗲𝗹 𝗮𝗰𝗰𝘂𝗿𝗮𝗰𝘆, but there’s another equally critical challenge: 𝗹𝗮𝘁𝗲𝗻𝗰𝘆. Unlike traditional systems, these models require careful orchestration of multiple stages, from processing prompts to delivering output, each with its own unique bottlenecks. Here’s a 5-step process to minimize latency effectively:  1️⃣ 𝗣𝗿𝗼𝗺𝗽𝘁 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴: Optimize by caching repetitive prompts and running auxiliary tasks (e.g., safety checks) in parallel.  2️⃣ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴: Summarize and cache context, especially in multimodal systems. 𝘌𝘹𝘢𝘮𝘱𝘭𝘦: 𝘐𝘯 𝘥𝘰𝘤𝘶𝘮𝘦𝘯𝘵 𝘴𝘶𝘮𝘮𝘢𝘳𝘪𝘻𝘦𝘳𝘴, 𝘤𝘢𝘤𝘩𝘪𝘯𝘨 𝘦𝘹𝘵𝘳𝘢𝘤𝘵𝘦𝘥 𝘵𝘦𝘹𝘵 𝘦𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨𝘴 𝘴𝘪𝘨𝘯𝘪𝘧𝘪𝘤𝘢𝘯𝘵𝘭𝘺 𝘳𝘦𝘥𝘶𝘤𝘦𝘴 𝘭𝘢𝘵𝘦𝘯𝘤𝘺 𝘥𝘶𝘳𝘪𝘯𝘨 𝘪𝘯𝘧𝘦𝘳𝘦𝘯𝘤𝘦.  3️⃣ 𝗠𝗼𝗱𝗲𝗹 𝗥𝗲𝗮𝗱𝗶𝗻𝗲𝘀𝘀: Avoid cold-boot delays by preloading models or periodically waking them up in resource-constrained environments.  4️⃣ 𝗠𝗼𝗱𝗲𝗹 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴: Focus on metrics like 𝗧𝗶𝗺𝗲 𝘁𝗼 𝗙𝗶𝗿𝘀𝘁 𝗧𝗼𝗸𝗲𝗻 (𝗧𝗧𝗙𝗧) and 𝗜𝗻𝘁𝗲𝗿-𝗧𝗼𝗸𝗲𝗻 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 (𝗜𝗧𝗟). Techniques like 𝘁𝗼𝗸𝗲𝗻 𝘀𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 and 𝗾𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 can make a big difference.  5️⃣ 𝗢𝘂𝘁𝗽𝘂𝘁 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀: Stream responses in real-time and optimize guardrails to improve speed without sacrificing quality. It’s ideal to think about latency optimization upfront, avoiding the burden of tech debt or scrambling through 'code yellow' fire drills closer to launch. Addressing it systematically can significantly elevate the performance and usability of LLM-powered applications. #AI #LLM #MachineLearning #Latency #GenerativeAI

    • +1
  • View profile for Yangqing Jia

    Co-founder & CEO of Lepton AI (now part of NVidia). Hiring top talents.

    9,493 followers

    People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?

  • View profile for Sivasankar Natarajan

    Technical Director | GenAI Practitioner | Azure Cloud Architect | Data & Analytics | Solutioning What’s Next

    8,820 followers

    𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 is the process of deliberately designing, structuring, and manipulating the inputs, metadata, memory, and environment surrounding a LLM to produce better, more reliable, and more useful outputs. 𝐇𝐞𝐫𝐞’𝐬 𝐡𝐨𝐰 𝐭𝐨 𝐭𝐡𝐢𝐧𝐤 𝐚𝐛𝐨𝐮𝐭 𝐢𝐭: - LLM is the CPU - Context Window is the RAM - Context Engineering is your OS Just like RAM, the context window has strict limits. What you load into it and when defines everything from performance to reliability. Think of it as "𝐏𝐫𝐨𝐦𝐩𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠" on steroids, with a focus on providing a rich and structured environment for the LLM to work within. 𝐇𝐞𝐫𝐞’𝐬 𝐭𝐡𝐞 𝐟𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤 𝐈 𝐤𝐞𝐞𝐩 𝐜𝐨𝐦𝐢𝐧𝐠 𝐛𝐚𝐜𝐤 𝐭𝐨: 𝐓𝐡𝐞 𝟒 𝐂𝐬 𝐨𝐟 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠: 1. Save Context Store important information outside the context window so it can be reused later. - Log task results - Storing conversation states and chat history - Persist metadata This is about Memory. Offload what the model doesn’t need right now but might need soon. 2. Select Context Pull relevant information into the context window for the task at hand. - Use search (RAG) - Lookup memory - Query prior interactions Selection quality = Output quality. Garbage in, Garbage out. 3. Compress Context When you exceed token limits, you compress. - Summarize - Cluster with embeddings - Trim token-by-token Think like a systems engineer. Signal > Noise. Token budgets are real. 4. Isolate Context Sometimes, the best boost in performance comes from narrowing scope. - Scope to one subtask - Modularize Agents - Run isolated threads Less clutter = Fewer Hallucinations = More Deterministic Behavior. --- 𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐦𝐚𝐭𝐭𝐞𝐫𝐬 ? Most LLM failures aren’t because of weak prompts. They fail because the context window is overloaded, underutilized, or just ignored. 𝐋𝐞𝐭 𝐦𝐞 𝐤𝐧𝐨𝐰 𝐢𝐟 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐫𝐮𝐧𝐝𝐨𝐰𝐧 𝐨𝐟 𝐏𝐫𝐨𝐦𝐩𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐯𝐬 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠.

  • View profile for Sebastian Raschka, PhD
    Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

    ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

    205,718 followers

    I just published a new tutorial article explaining how KV caching works in LLMs, both conceptually and in code, with a clean, from-scratch implementation. It's one of the key techniques for efficient LLM inference. While recovering from an injury and taking a break from more research-heavier writing in the last few weeks, I wanted to share this practical guide on a topic many readers asked about (and one I deliberately left out of the Build a Large Language Model From Scratch book due to its added complexity). In this tutorial, I walk through: 1. Why LLMs recompute attention weights inefficiently during generation 2. How a KV cache avoids that by storing key/value vectors for reuse 3. A side-by-side walkthrough of inference with and without caching 4. Step-by-step code changes to implement caching in a readable way 5. Performance comparison and key optimizations (like preallocation and sliding windows) Even with a tiny 124M parameter model, enabling KV caching led to a substantial speed-up in generation. 🔗 Full tutorial: https://lnkd.in/g-vYFVTa Happy reading, and as always, feel free to share feedback or questions!

  • View profile for Zain Hasan

    AI builder & teacher | AI/ML @ Together AI | ℕΨ EngSci @ UofT | Lecturer | ex-Vector DBs, Data Scientist, Health Tech Founder

    16,265 followers

    You don't need a 2 trillion parameter model to tell you the capital of France is Paris. Be smart and route between a panel of models according to query difficulty and model specialty! New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance. Overview: Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75) The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks. They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5) (2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b) A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs. Explore different routing approaches: - Similarity-weighted (SW) ranking - Matrix factorization - BERT query classifier - Causal LLM query classifier Neat Ideas to Build From: - Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation. - Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.) - Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time. - MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc. - Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer

  • Are your LLM apps still hallucinating? Zep used to as well—a lot. Here’s how we worked to solve Zep's hallucinations. We've spent a lot of cycles diving into why LLMs hallucinate and experimenting with the most effective techniques to prevent it. Some might sound familiar, but it's the combined approach that really moves the needle. First, why do hallucinations happen? A few core reasons: 🔍 LLMs rely on statistical patterns, not true understanding. 🎲 Responses are based on probabilities, not verified facts. 🤔 No innate ability to differentiate truth from plausible fiction. 📚 Training datasets often include biases, outdated info, or errors. Put simply: LLMs predict the next likely word—they don’t actually "understand" or verify what's accurate. When prompted beyond their knowledge, they creatively fill gaps with plausible (but incorrect) info. ⚠️ Funny if you’re casually chatting—problematic if you're building enterprise apps. So, how do you reduce hallucinations effectively? The #1 technique: grounding the LLM in data. - Use Retrieval-Augmented Generation (RAG) to anchor responses in verified data. - Use long-term memory systems like Zep to ensure the model is always grounded in personalization data: user context, preferences, traits etc - Fine-tune models on domain-specific datasets to improve response consistency and style, although fine-tuning alone typically doesn't add substantial new factual knowledge. - Explicit, clear prompting—avoid ambiguity or unnecessary complexity. - Encourage models to self-verify conclusions when accuracy is essential. - Structure complex tasks with chain-of-thought prompting (COT) to improve outputs or force "none"/unknown responses when necessary. - Strategically tweak model parameters (e.g., temperature, top-p) to limit overly creative outputs. - Post-processing verification for mission-critical outputs, for example, matching to known business states. One technique alone rarely solves hallucinations. For maximum ROI, we've found combining RAG with a robust long-term memory solution (like ours at Zep) is the sweet spot. Systems that ground responses in factual, evolving knowledge significantly outperform. Did I miss any good techniques? What are you doing in your apps?

  • View profile for Meri Nova

    ML/AI Engineer | Community Builder | Founder @Break Into Data | ADHD + C-PTSD advocate

    145,152 followers

    You only need 10 Prompt Engineering techniques to build a production-grade AI application. Save these 👇 After analyzing 100s of prompting techniques, I found the most common principles that every #AIengineer follows. Keep them in mind when building apps with LLMs: 1. Stop relying on vague instructions; be explicit instead. ❌ Don't say: "Analyze this customer review." ✅ Say: "Analyze this customer review and extract 3 actionable insights to improve the product." Why? Ambiguity confuses models. 2. Stop overloading prompts   ❌ Asking the model to do everything at once.     ✅ Break it down:   Step 1: Identify the main issues. Step 2: Suggest specific improvements for each issue.  Why? Smaller steps reduce errors and improve reliability.  3. Always provide examples.   ❌ Skipping examples for context-dependent tasks.     ✅ Follow this example: 'The battery life is terrible.' → Insight: Improve battery performance to meet customer expectations.  Why? Few-shot examples improve performance.  4. Stop ignoring instruction placement.   ❌ Putting the task description in the middle.   ✅ Place instructions at the start or end of the system prompt.  Why? Models process beginning and end information more effectively.  5. Encourage step-by-step thinking.   ❌ What are the insights from this review? ✅ Analyze this review step by step: First, identify the main issues. Then, suggest actionable insights for each issue. Why? Chain-of-thought (CoT) prompting reduces errors.  6. Stop ignoring output formats.   ❌ Expecting structured outputs without clear instructions.     ✅ Provide the output as JSON: {‘Name’: [value], ‘Age’: [value]}.  Use Pydantic to validate the LLM outputs. Why? Explicit formats prevent unnecessary or malformed text. 7. Restrict to the provided context.   ❌ Answer the question about a customer.   ✅ Answer only using the customer's context below. If unsure, respond with 'I don’t know. Why? Clear boundaries prevent reliance on inaccurate internal knowledge.  8. Stop assuming that the first version of a prompt is the best version.   ❌ Never iterating on prompts   ✅ Use the model to critique and refine your prompt. 9. Don't forget about the edge cases.   ❌ Designing for the “ideal” or most common inputs.   ✅ Test different edge cases and specify fallback instructions.   Why? Real-world use often involves imperfect inputs. Cover for most of them. 10. Stop overlooking prompt security; design prompts defensively.**     ❌ Ignoring risks like prompt injection or extraction.     ✅ Explicitly define boundaries: *"Do not return sensitive information."*     Why? Defensive prompts reduce vulnerabilities and prevent harmful outputs.  --- #promptengineering

  • View profile for Armand Ruiz
    Armand Ruiz Armand Ruiz is an Influencer

    building AI systems

    202,072 followers

    Guide to Building an AI Agent 1️⃣ 𝗖𝗵𝗼𝗼𝘀𝗲 𝘁𝗵𝗲 𝗥𝗶𝗴𝗵𝘁 𝗟𝗟𝗠 Not all LLMs are equal. Pick one that: - Excels in reasoning benchmarks - Supports chain-of-thought (CoT) prompting - Delivers consistent responses 📌 Tip: Experiment with models & fine-tune prompts to enhance reasoning. 2️⃣ 𝗗𝗲𝗳𝗶𝗻𝗲 𝘁𝗵𝗲 𝗔𝗴𝗲𝗻𝘁’𝘀 𝗖𝗼𝗻𝘁𝗿𝗼𝗹 𝗟𝗼𝗴𝗶𝗰 Your agent needs a strategy: - Tool Use: Call tools when needed; otherwise, respond directly. - Basic Reflection: Generate, critique, and refine responses. - ReAct: Plan, execute, observe, and iterate. - Plan-then-Execute: Outline all steps first, then execute. 📌 Choosing the right approach improves reasoning & reliability. 3️⃣ 𝗗𝗲𝗳𝗶𝗻𝗲 𝗖𝗼𝗿𝗲 𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻𝘀 & 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀 Set operational rules: - How to handle unclear queries? (Ask clarifying questions) - When to use external tools? - Formatting rules? (Markdown, JSON, etc.) - Interaction style? 📌 Clear system prompts shape agent behavior. 4️⃣ 𝗜𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 𝗮 𝗠𝗲𝗺𝗼𝗿𝘆 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆 LLMs forget past interactions. Memory strategies: - Sliding Window: Retain recent turns, discard old ones. - Summarized Memory: Condense key points for recall. - Long-Term Memory: Store user preferences for personalization. 📌 Example: A financial AI recalls risk tolerance from past chats. 5️⃣ 𝗘𝗾𝘂𝗶𝗽 𝘁𝗵𝗲 𝗔𝗴𝗲𝗻𝘁 𝘄𝗶𝘁𝗵 𝗧𝗼𝗼𝗹𝘀 & 𝗔𝗣𝗜𝘀 Extend capabilities with external tools: - Name: Clear, intuitive (e.g., "StockPriceRetriever") - Description: What does it do? - Schemas: Define input/output formats - Error Handling: How to manage failures? 📌 Example: A support AI retrieves order details via CRM API. 6️⃣ 𝗗𝗲𝗳𝗶𝗻𝗲 𝘁𝗵𝗲 𝗔𝗴𝗲𝗻𝘁’𝘀 𝗥𝗼𝗹𝗲 & 𝗞𝗲𝘆 𝗧𝗮𝘀𝗸𝘀 Narrowly defined agents perform better. Clarify: - Mission: (e.g., "I analyze datasets for insights.") - Key Tasks: (Summarizing, visualizing, analyzing) - Limitations: ("I don’t offer legal advice.") 📌 Example: A financial AI focuses on finance, not general knowledge. 7️⃣ 𝗛𝗮𝗻𝗱𝗹𝗶𝗻𝗴 𝗥𝗮𝘄 𝗟𝗟𝗠 𝗢𝘂𝘁𝗽𝘂𝘁𝘀 Post-process responses for structure & accuracy: - Convert AI output to structured formats (JSON, tables) - Validate correctness before user delivery - Ensure correct tool execution 📌 Example: A financial AI converts extracted data into JSON. 8️⃣ 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝘁𝗼 𝗠𝘂𝗹𝘁𝗶-𝗔𝗴𝗲𝗻𝘁 𝗦𝘆𝘀𝘁𝗲𝗺𝘀 (𝗔𝗱𝘃𝗮𝗻𝗰𝗲𝗱) For complex workflows: - Info Sharing: What context is passed between agents? - Error Handling: What if one agent fails? - State Management: How to pause/resume tasks? 📌 Example: 1️⃣ One agent fetches data 2️⃣ Another summarizes 3️⃣ A third generates a report Master the fundamentals, experiment, and refine and.. now go build something amazing! Happy agenting! 🤖

  • View profile for Goku Mohandas

    ML Lead at Anyscale

    25,958 followers

    Excited to share our end-to-end LLM workflows guide that we’ve used to help our industry customers fine-tune and serve OSS LLMs that outperform closed-source models in quality, performance and cost. Key LLM workloads with docs.ray.io and Anyscale: - 🔢 Preprocess our dataset (filter, schema, etc.) with batch data processing. - 🛠️ Fine-tune our LLMs (ex. Meta Llama 3) with full control (LoRA/full param, compute, loss, etc.) and optimizations (parallelism, mixed precision, flash attn, etc.) with distributed training. - ⚖️ Evaluate our fine-tuned LLMs with batch inference using Ray + vLLM. - 🚀 Serve our LLMs as a production application that can autoscale, swap between LoRA adapters, optimize for latency/throughput, etc. Key Anyscale infra capabilities that keeps these workloads efficient and cost-effective: - ✨ Automatically provision worker nodes (ex. GPUs) based on our workload's needs. They'll spin up, run the workload and then scale back to zero (only pay for compute when needed). - 🔋 Execute workloads (ex. fine-tuning) with commodity hardware (A10s) instead of waiting for inaccessible resources (H100s) with data/model parallelism. - 🔙 Configure spot instance to on-demand fallback (or vice-versa) for cost savings. - 🔄 Swap between multiple LoRA adapters using one base model (optimized with multiplexing). - ⚡️ Autoscale to meet demand and scale back to zero. 🆓 You can run this guide entirely for free on Anyscale (no credit card needed). Instructions in the links below 👇 🔗 Links: - Blog post: https://lnkd.in/gvPQGzjh - GitHub repo: https://lnkd.in/gxzzuFAE - Notebook: https://lnkd.in/gmMxb36y

Explore categories