From the course: Enterprise AI Development with GitHub Models and Azure

Architectural overview

From the course: Enterprise AI Development with GitHub Models and Azure

Architectural overview

- [Instructor] In this chapter, we are going to build a solution step by step using GitHub models. For the development environment, we will use a GitHub code space. That means you do not have to install any dependencies locally on your own machine. It also means we can get started rather easily. We will also deploy the solution to Azure OpenAI, where I will show you the moving parts of that solution. You can use the sample repository shown here on the screen to follow along with the steps. Before we get started, let's take a look at what we're going to build and what the requirements are. We are going to build a solution that enables our users to chat with an internal knowledge base. This could be any data source you have access to, as long as you have a way to convert it into text. For the knowledge base in this example, I will use the post from my own blog. These are already stored nicely inside of a repository and they are also stored as markdown, which means we can index that as plain text. The goal is to enable a user to ask questions about the topics in the blog post using natural language. Bonus points if they can use their own native language to ask the questions. The last requirement is that I want to display references to the sources so that the user has a way to validate the information. Let's take a look at the data flow of this solution. First, we start with a data source and inject that data source into a vector store. We then receive a user prompt with a question on data that might be available in our data source. We send the user prompt to the vector store to find the relevant documents. From the Vector store, we receive relevant fragments from the documents that contain information relevant for answering the prompt. We send both the fragments and the prompt to a large language model to let it use the information in the fragments to answer the user's question. We can send the response from the model back to the user. For the vector store, I will use text embedding three small from OpenAI, as that can translate the text documents into a token representation of the text. For the response in natural language, I will use the GPT-4.0 mini model, as that is a model with broad capabilities and can be used in multiple languages. Let's take a look at setting up the vector store. We'll start with the GitHub repository. It already contains a folder with all the block posts in markdown. We can clone that repository to a local folder and then ingest the files with a directory reader. We can send the files into the vector store index then. The vector store index will use the embeddings model to convert the text documents into embeddings. The directory reader and Vector store index are available in the Llama Index SDK. This SDK does the heavy lifting for us and we only have to feed it the data and tell it to convert it using the OpenAI endpoint. Let's see this solution so far in action. In the example repository, I have configured this solution using Python scripts. You can follow the instructions in the Read Me to learn all about the moving pieces. In the repository, there is a file called Local Script that is the starting point of the application. In the utils file, all the functions are stored for an easy overview. It starts with importing all the package onto the top of the file, and then we set up everything we need to be able to let Llama Index communicate with the GitHub model's API. As the next step, we download the block repository if it doesn't exist yet. This is executed using a Git clone operation to download the public repository. If the directory with the vector information does not exist yet, we start by calling the Simple Directory Reader and loading the Vector store index from those documents. This we'll call the configured embeddings model to retrieve the embeddings from the documents. The Vector Store will then keep those embeddings in memory. To skip this step in the next time we run this script, we store the index to disk to save the requests to the model. The next time this script runs, we load the data again, but now from the directory with a vector store was persisted. Let's go back to the main script. From the index object, we can now get a retriever. This retriever can look in the index and find the parts of the input documents that match a given user prompt. This will return those parts in what is called fragments. Using those fragments, we can now check the index on line 56 to find the document the fragment was a part of. We want to show these references to the user so that they can validate the response. Since the documents are local files in the repository on disk, we need to convert those documents back to the URLs in my blog so that the user can click on those links and validate the information in the response. As a system prompt, you can see that we have configured that the model should behave like a helpful assistant that will retrieve information from a given context. Next to that, we insert the context data to base the results on. Those are the fragments we have found. For the user input, we use the same prompt that was used to find the relevant documents, and at the end of the script, we print the response of the large language model to show back to the user. Let's run the script to see the moving parts in action. In the logs, we can see the output of the rate limits first, and I'm logging down so that we can keep track of how heavily we are using the API. Since the repository is not stored yet, a GIT clone is executed to download the data source. Note that the size of the directory is only one megabyte. As the next step, we can now read all the blog posts into the directory reader and send in each document to the embeddings model. You can see that the Llama Index SDK does all the heavy lifting for us and sends in the different files to the model. Note that the number of calls is much lower than the amount of files that we have in the directory. All that comes with the help of the Llama Index SDK. As the next step, we now persist the new Vector index to disk so that we can use it the next time we need it. Now note the size of the persistent objects on disk. With 34 megabytes for the Vector Store, we now have a lot of extra data available. This is a good example of how much data is needed to calculate information using a large language model. Here we can see how long the retrieval took to find the fragments we need in the documents that are stored in the Vector store. Also, note the relevant score of the fragment at the end of the line. This is always a value between zero and one. The higher the score, the higher the fragment's fit is with the prompt. We then map the files on disc to the URL of the blog post so that we can give the user the information to click on them and validate the information. We now send the fragment text together with the user prompt into the GPT-4.0 mini model and let the model answer the question with the information we found in the Vector store. And with that information, we get a natural language answer to our prompt. With this script, we have captured all requirements of our solution. We load the data source into a vector store and then send in the user prompt to first the vector store and then the result fragments and the into a large language model. For the sake of simplicity, I have left out creating an entire user interface to let the user have a chat conversation with the data source. Calling the Python script is in our case enough of a user interface for now. We have also seen that the creation of the Vector Index is pretty fast using AI. Since we do not want to use a costly compute option with a model every time we run the script, we can store that index on disk. We have seen that Distort index is pretty bulky compared to the original files. What also stands out is that reading the Vector Store from disc is also quite a costly operation, especially with this kind of a small data source. Handling this at scale needs a different solution to be able to handle incoming prompts in a timely manner. We could store the Vector index in a database and speed things up that way, or we can use cloud resources that are tailored for these kinds of use cases. In the next video, I'll show how to leverage Azure resources to host this data source in a more robust solution.

Contents