From the course: Data Quality: Transactions, Ingestions, and Storage

Initial ingestion into PostgreSQL

- [Instructor] Welcome to our codespace. Now, for this course, most of our work is gonna be happening in this notebook right here. I already have it scrolled down to the section I wanna show you. But remember, there's all this background information that I assume that you've already read through here. Now, a quick high level overview of our data platform architecture. We have our CSVs, which you go into Data Platform, Data. And you see our clean data, our dirty data, and our schemas. That's gonna be ingested into Postgres, which is our transactional database. And then that's gonna be replicated into Minio, which is our object storage. If you're familiar with more of a AWS system, that'll be S3. And then we're gonna have DuckDB, which is our analytical database, read from the object storage. And this is gonna serve together as our data lakehouse. In addition, we have our clients, which you can see on this side right here, which is essentially instructions on how you connect to the various components of the data platform. We then have our Jupyter Notebook, which we're in right now, which connects to the clients, and then allows us to get the logs, take actions, do various things. And this is all running in Docker within codespace. As a little bonus, if you wanna dive deep into the Docker compose, you can read that here, but that's outside the scope of this course. Now, first step, as always with any notebook, we have to import our packages. So we're gonna run that real quick. And as a quick note, if this is your first time running the Jupyter Notebook in this codespace, you wanna choose the kernel, and we're gonna do Python 3.10 in this system. Now let's go run our run ingestion Python script. And now we can do the codespace setup. This is already run in the background, but we're gonna run it again. And as you can see, we have a whole bunch of logs here. Now, I won't bore you by going through this line by line, but I highly encourage you to pause the video and read through it. It provides a lot of information of the steps of what's happening, as well as which Python scripts and classes and functions are being called. Now, for our next step is we're assuming that we have our data now in Postgres. Let's create a quick SQL query to create a report. Now, I created a little wrapper function to help you execute the queries. And so you can just write SQL queries within this section right here, and then pass it into this Postgres execute query. In addition, I have this return in this data frame as true, just to make it easy to see within the notebook. So let's run this. And you can quickly see that we have our data here. Again, simple report, because our goal isn't necessarily to do analytics, it's to see how the system works, test its limits, and try to break it. In the next video, what we'll do is we'll go take our dirty data and ingest it into the system and see how it impacts things.

Contents