From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source
Unlock this course with a free trial
Join today to access over 24,900 courses taught by industry experts.
Collecting data to generate our corpus
From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source
Collecting data to generate our corpus
- [Instructor] Once we have our LLM up and running, we need to generate data for our corpus. The corpus is a collection of texts that our RAG model will source knowledge from in order to generate a response to a user's query. Later in the course, we'll learn about how our RAG model will find the parts of the corpus most relevant to a given query and instruct the LLM to use knowledge from those texts to generate the response. But for now, we'll simply be focusing on generating this corpus. One of the easiest ways to generate a corpus for educational purposes is by pulling Wikipedia articles. There's a Python package called Wikipedia, which gives us the ability to perform a search over all Wikipedia articles and extract the article titles and text. Using this will generate a corpus relevant to any topic of interest available on Wikipedia. In practice, you'll be working with a corpus containing documents relevant to the…
Contents
-
-
-
Running your LLM from open source2m 16s
-
(Locked)
Collecting data to generate our corpus1m 54s
-
(Locked)
What are vector embeddings, and how are they generated?3m 12s
-
(Locked)
Setting up a database and retrieving vectors and files2m 53s
-
(Locked)
Vectorizing a query and finding relevant text2m 48s
-
(Locked)
Prompt engineering and packaging pieces together3m 17s
-
-
-
-
-