From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source
Unlock this course with a free trial
Join today to access over 24,900 courses taught by industry experts.
Extract text from different local file formats with Docling
From the course: Hands-On AI: Build a RAG Model from Scratch with Open Source
Extract text from different local file formats with Docling
- [Instructor] When building your own RAG model, you'll be working with locally stored data that'll potentially be encoded in various formats. In this chapter, we'll go over Docling, which is a convenient package made by IBM for converting all documents into dictionaries. This is very useful for many NLP tasks, since dictionaries are easily interpretable. Now, the first step is to install the package with a simple pip install docling. And once we have that, let's go ahead and import the necessary functions from that package, which will be the document converter. And we will do that by first creating a new file, which we will call extract_text.py. So now we've imported our document converter. And now let's define a generic function that can parse any file format. We'll call it convert_doc. And convert_doc simply uses the Docling Converter, and then exports the output in a JSON format by exporting to dict here. The format…
Contents
-
-
-
-
Setting up a dev container7m 56s
-
(Locked)
Setting up environment and installing Ollama5m 40s
-
(Locked)
Creating a model file8m 33s
-
(Locked)
Running Ollama programmatically through Python7m 43s
-
(Locked)
Generating the corpus10m 17s
-
(Locked)
Extract text from different local file formats with Docling4m 43s
-
-
-
-