3

I am a relatively new user of Python. What is the best way of parsing and processing a CSV and loading it into a local Postgres Database (in Python)?

It was recommended to me to use the CSV library to parse and process the CSV. In particular, the task at hand says:

The data might have errors (some rows may be not be parseable), the data might be duplicated, the data might be really large.

Is there a reason why I wouldn't be able to just use pandas.read_csv here? Does using the CSV library make parsing and loading it into a local Postgres database easier? In particular, if I just use pandas will I run into problems if rows are unparseable, if the data is really big, or if data is duplicated? (For the last bit, I know that pandas offers some relatively clean solutions for de-dupping).

I feel like pandas.read_csv and pandas.to_sql can do a lot of work for me here, but I'm not sure if using the CSV library offers other advantages.

Just in terms of speed, this post: https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file seems to suggest that pandas.read_csv performs the best?

3
  • "Best" is undefined so this is gonna get closed as opinion-based as it's now. Commented Mar 14, 2016 at 2:20
  • Sorry, I guess it's not the best title, but the question lies really more in the comments. For example, does using the csv library give me more ability to handle very large datasets over pandas? Or does using the csv library allow for easier handling of duplicates? Commented Mar 14, 2016 at 2:23
  • A cornerstone issue to your dilemma may be finding out what your advisor meant when they recommended to "use a CSV library". They may very well just urged you not to try parsing it by hand. Commented Mar 14, 2016 at 2:44

2 Answers 2

2

A quick googling didn't reveal any serious drawbacks in pandas.read_csv regarding its functionality (parsing correctness, supported types etc.). Moreover, since you appear to be using pandas to load the data into the DB, too, reading directly into a DataFrame is a huge boost in both performance and memory (no redundant copies).

There are only memory issues for very large datasets - but these are not library's fault. How to read a 6 GB csv file with pandas has instructions on how to process a large .csv in chunks with pandas.

Regarding "The data might have errors", read_csv has a few facilities like converters, error_bad_lines and skip_blank_lines (specific course of action depends on if and how much corruption you're supposed to be able to recover).

Sign up to request clarification or add additional context in comments.

Comments

1

I had a school project just last week that required me to load data from a csv and insert it into a postgres database. So believe me when I tell you this: it's way harder than it has to be unless you use pandas. The issue is sniffing out the data types. Okay, so if your database is all a string datatype, forget what I said, you're golden. But if you have a csv with an assortment of datatypes, either you get to sniff them yourself or you can use pandas which does it efficiently and automatically. Plus pandas has a nifty write to sql method which can be easily adapted to work with postgres via a sql alchemy connection, too.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.