pandas.read_csv vs other csv libraries for loading CSV into a Postgres Database

Question

I am a relatively new user of Python. What is the best way of parsing and processing a CSV and loading it into a local Postgres Database (in Python)?

It was recommended to me to use the CSV library to parse and process the CSV. In particular, the task at hand says:

The data might have errors (some rows may be not be parseable), the data might be duplicated, the data might be really large.

Is there a reason why I wouldn't be able to just use pandas.read_csv here? Does using the CSV library make parsing and loading it into a local Postgres database easier? In particular, if I just use pandas will I run into problems if rows are unparseable, if the data is really big, or if data is duplicated? (For the last bit, I know that pandas offers some relatively clean solutions for de-dupping).

I feel like pandas.read_csv and pandas.to_sql can do a lot of work for me here, but I'm not sure if using the CSV library offers other advantages.

Just in terms of speed, this post: https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file seems to suggest that pandas.read_csv performs the best?

"Best" is undefined so this is gonna get closed as opinion-based as it's now. — ivan_pozdeev
– ivan_pozdeev, Commented Mar 14, 2016 at 2:20
Sorry, I guess it's not the best title, but the question lies really more in the comments. For example, does using the csv library give me more ability to handle very large datasets over pandas? Or does using the csv library allow for easier handling of duplicates? — Vincent
– Vincent, Commented Mar 14, 2016 at 2:23
A cornerstone issue to your dilemma may be finding out what your advisor meant when they recommended to "use a CSV library". They may very well just urged you not to try parsing it by hand. — ivan_pozdeev
– ivan_pozdeev, Commented Mar 14, 2016 at 2:44

Community · Accepted Answer · 2017-05-23 11:54:13Z

2

A quick googling didn't reveal any serious drawbacks in pandas.read_csv regarding its functionality (parsing correctness, supported types etc.). Moreover, since you appear to be using pandas to load the data into the DB, too, reading directly into a DataFrame is a huge boost in both performance and memory (no redundant copies).

There are only memory issues for very large datasets - but these are not library's fault. How to read a 6 GB csv file with pandas has instructions on how to process a large .csv in chunks with pandas.

Regarding "The data might have errors", read_csv has a few facilities like converters, error_bad_lines and skip_blank_lines (specific course of action depends on if and how much corruption you're supposed to be able to recover).

edited May 23, 2017 at 11:54

CommunityBot

11 silver badge

answered Mar 14, 2016 at 2:48

ivan_pozdeev

36.6k19 gold badges115 silver badges165 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mike Lane · Accepted Answer · 2016-03-14 03:15:32Z

I had a school project just last week that required me to load data from a csv and insert it into a postgres database. So believe me when I tell you this: it's way harder than it has to be unless you use pandas. The issue is sniffing out the data types. Okay, so if your database is all a string datatype, forget what I said, you're golden. But if you have a csv with an assortment of datatypes, either you get to sniff them yourself or you can use pandas which does it efficiently and automatically. Plus pandas has a nifty write to sql method which can be easily adapted to work with postgres via a sql alchemy connection, too.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Collectives™ on Stack Overflow

pandas.read_csv vs other csv libraries for loading CSV into a Postgres Database

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related