From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source

Unlock this course with a free trial

Join today to access over 24,900 courses taught by industry experts.

Extract text from different local file formats with Docling

Extract text from different local file formats with Docling

From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source

Extract text from different local file formats with Docling

- [Instructor] When building your own RAG model, you'll be working with locally stored data that'll potentially be encoded in various formats. In this chapter, we'll go over Docling, which is a convenient package made by IBM for converting all documents into dictionaries. This is very useful for many NLP tasks, since dictionaries are easily interpretable. Now, the first step is to install the package with a simple pip install docling. And once we have that, let's go ahead and import the necessary functions from that package, which will be the document converter. And we will do that by first creating a new file, which we will call extract_text.py. So now we've imported our document converter. And now let's define a generic function that can parse any file format. We'll call it convert_doc. And convert_doc simply uses the Docling Converter, and then exports the output in a JSON format by exporting to dict here. The format…

Contents