1

I'm following the Python for Data Analysis book. It tells me to get the ALL file from http://www.fec.gov/disclosurep/PDownload.do and load it with pandas:

import pandas as pd

fec = pd.read_csv('P00000001-ALL.csv')

But the actual file has changed since the book was written. The old file (which is available here https://github.com/pydata/pydata-book/blob/master/ch09/P00000001-ALL.csv) loads just fine

fec = pd.read_csv('../pydata-book/ch09/P00000001-ALL.csv')

But the new one is loaded wrong, in that the columns seem to have shifted (the first column value is dropped)

cmte_id                           P60008059
cand_id                           Bush, Jeb
cand_nm              EASTON, AMY KELLY MRS.
contbr_nm                      KEY BISCAYNE
contbr_city                              FL
contbr_st                         331491716
contbr_zip                        HOMEMAKER
contbr_employer                   HOMEMAKER
contbr_occupation                      2700
contb_receipt_amt                 26-JUN-15
contb_receipt_dt                        NaN
receipt_desc                            NaN
memo_cd                                 NaN
memo_text                             SA17A
form_tp                             1024106
file_num                        SA17.114991
tran_id                               P2016
election_tp                             NaN

The actual row is

C00579458,"P60008059","Bush, Jeb","EASTON, AMY KELLY MRS.","KEY BISCAYNE","FL","331491716","HOMEMAKER","HOMEMAKER",2700,26-JUN-15,"","","","SA17A","1024106","SA17.114991","P2016",

So that C00579458 is lost somewhere.

The header looks like this. cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp

3
  • Can you add few rows including header of the csv causing issue and the exact output you are getting for those rows. Commented Oct 1, 2015 at 10:03
  • Hi Anand, you have the header and one row above? Do you need me to add a few more rows? Commented Oct 1, 2015 at 10:11
  • when you check your dataframe, is the first element being considered as index? Commented Oct 1, 2015 at 10:22

2 Answers 2

1

As the other answer already suggess , you have malformed csv with a comma at the end of the row. Hence, this causes pandas to consider the first column as the index column.

To workaround this, you can pass index_col=False argument to pandas.read_csv() function. Example -

In [24]: s = io.StringIO("""cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
   ....: C00579458,"P60008059","Bush, Jeb","EASTON, AMY KELLY MRS.","KEY BISCAYNE","FL","331491716","HOMEMAKER","HOMEMAKER",2700,26-JUN-15,"","","","SA17A","1024106","SA17.114991","P2016",""")

In [25]: df = pd.read_csv(s)  #Issue

In [26]: df
Out[26]:
             cmte_id    cand_id                 cand_nm     contbr_nm  \
C00579458  P60008059  Bush, Jeb  EASTON, AMY KELLY MRS.  KEY BISCAYNE

          contbr_city  contbr_st contbr_zip contbr_employer  \
C00579458          FL  331491716  HOMEMAKER       HOMEMAKER

           contbr_occupation contb_receipt_amt  contb_receipt_dt  \
C00579458               2700         26-JUN-15               NaN

           receipt_desc  memo_cd memo_text  form_tp     file_num tran_id  \
C00579458           NaN      NaN     SA17A  1024106  SA17.114991   P2016

           election_tp
C00579458          NaN

In [29]: df = pd.read_csv(s,index_col=False)  #No issue

In [30]: df
Out[30]:
     cmte_id    cand_id    cand_nm               contbr_nm   contbr_city  \
0  C00579458  P60008059  Bush, Jeb  EASTON, AMY KELLY MRS.  KEY BISCAYNE

  contbr_st  contbr_zip contbr_employer contbr_occupation  contb_receipt_amt  \
0        FL   331491716       HOMEMAKER         HOMEMAKER               2700

  contb_receipt_dt  receipt_desc  memo_cd  memo_text form_tp  file_num  \
0        26-JUN-15           NaN      NaN        NaN   SA17A   1024106

       tran_id election_tp
0  SA17.114991       P2016

This is explained correctly in the documentations -

index_col : int or sequence or False, default None

Column to use as the row labels of the DataFrame. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False to force pandas to not use the first column as the index (row names)

(Emphasis mine)

Sign up to request clarification or add additional context in comments.

Comments

1

There is an extra comma in the end of each row in the raw data.

C00458844,"P60006723","Rubio, Marco","HEFFERNAN, MICHAEL","APO","AE","090960009","INFORMATION REQUESTED PER BEST EFFORTS","INFORMATION REQUESTED PER BEST EFFORTS",210,27-JUN-15,"","","","SA17A","1015697","SA17.796904","P2016",

If you have 2 commas, each row would shift by 2 columns.

1 Comment

Aha! So the source file is corrupt! Gayatri, is there a way to fix this with Pandas (tell it about the columns or something)? Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.