From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source

Unlock this course with a free trial

Join today to access over 24,900 courses taught by industry experts.

Collecting data to generate our corpus

Collecting data to generate our corpus

- [Instructor] Once we have our LLM up and running, we need to generate data for our corpus. The corpus is a collection of texts that our RAG model will source knowledge from in order to generate a response to a user's query. Later in the course, we'll learn about how our RAG model will find the parts of the corpus most relevant to a given query and instruct the LLM to use knowledge from those texts to generate the response. But for now, we'll simply be focusing on generating this corpus. One of the easiest ways to generate a corpus for educational purposes is by pulling Wikipedia articles. There's a Python package called Wikipedia, which gives us the ability to perform a search over all Wikipedia articles and extract the article titles and text. Using this will generate a corpus relevant to any topic of interest available on Wikipedia. In practice, you'll be working with a corpus containing documents relevant to the…

Contents