Join now Sign in

From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source

Running your LLM from open source

From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source

Start my 1-month free trial Buy for my team

Running your LLM from open source

“

- [Instructor] Running an LLM from Open Source is one of our first tasks. There are many ways to take on this task, but we'll assume that you have limited computational resources, and that you'll need to rely on optimizations that help get things up and running on a CPU. Though everything we cover will also function well for GPU use as well. We'll also assume that you don't intend to train these models, and that you only intend to perform inference. Training is a very resource-intensive process where you use data to update the weights in your model. And inference is a less resource-intensive process where you call your model, and have it generate tokens to complete a sentence or respond to a question, which is what over 99% of people using LLMs are doing. Like when you visit your favorite website that has a chat bot, we'll be focusing on the use of an implementation of transformers code that was put together by the Open Source community that's called Ollama.CPP. Their goal was to enable those who wanna run these models off their laptops to do so, and we'll be doing all of our work within a Linux machine by spinning up Ubuntu on GitHub code spaces. and would utilize Ollama, which helps automate the process of installing different models, and running Ollama.CPP without having to get into the nitty gritty of the code. The final thing that I'll mention is that we'll also be using quantization of the original models in order to help our inference run faster. That just means that instead of using the original weights in our model, we'll use a truncated version of those weights. So for example, instead of storing a weight value of this very long number here, we would store a weight value of this much shorter number, cutting the amount of RAM required to store the weights in half. This makes running everything much more efficient with a very small loss in performance. Finally, to wrap everything up, we'll be looking into a few model parameters, which can be tuned, and we'll discuss how they impact inference.

Contents