0

I am currently converting PDFS to text in a giant folder and then outputting certain key words to an Excel file. Everything is working correctly except that even though I have multiple PDFS in my folder, they are over writing each other on column A1.

How do I iterate it so the next dictionary goes to the subsequent row?

custData = {}

def data_grabbing(pdf):
    row = 0
    col = 0
    string = convert_pdf_to_txt(pdf)
    lines = list(filter(bool,string.split('\n')))
    for i in range(len(lines)):
        if 'Lead:' in lines[i]:
            custData['Name'] = lines[i+2]
        elif 'Date:Date:Date:Date:' in lines[i]:
            custData['Fund Manager'] = lines[i+2]
        elif 'Priority:' in lines[i]:
            custData['Industry'] = lines[i+2]
            custData['Date'] = lines[i+1]
            custData['Deal Size']= lines [i+3]
        elif 'DEAL QUALIFYING MEMORANDUM' in lines[i]:
            custData['Owner'] = lines[i+2]
        elif 'Fund Manager' in lines[i]:
            custData['Investment Type'] = lines [i+2]
    print custData
    for item, descrip in custData.iteritems():
        worksheet.write(row, col,     item)
        worksheet.write(row+1, col, descrip)
        col += 1
    row +=2


for myFile in os.listdir(directory):
    if myFile.endswith(".pdf"):
        data_grabbing(os.path.join(directory, myFile))
workbook.close()
10
  • move the row +=2 to inside your for item... loop Commented Jun 23, 2017 at 20:26
  • I tried that but now my output on my excel file looks like a zig zag ladder Commented Jun 23, 2017 at 20:29
  • Take the col += 1 out if you want the results to stay in column A Commented Jun 23, 2017 at 20:32
  • Quick and dirty: put row = 0 outside your function in global scope. Inside your fucntion replace row = 0 with global row. Not really the best way to handle persistent state, but you can get away with it. Commented Jun 23, 2017 at 20:33
  • @Gerrat: OP is writing columnwise for an individual PDF. I believe you missed that with your advice to move row += 2 inside the loop. The OP's real issue is that the next call to data_grabbing doesn't start two rows down (and then two more rows down on another call and so on). Commented Jun 23, 2017 at 20:39

1 Answer 1

1

Some of your options are:

  1. make row a global, and instantiate outside function (@StevenRumbalski's suggestion)
  2. Make datag_grabbing a method of a class, and make row an instance variable.
  3. Pass the current row into your function.

I'll show option #3 (but #2 might be preferrable):

custData = {}

def data_grabbing(pdf, row):
    col = 0
    string = convert_pdf_to_txt(pdf)
    lines = list(filter(bool,string.split('\n')))
    for i in range(len(lines)):
        if 'Lead:' in lines[i]:
            custData['Name'] = lines[i+2]
        elif 'Date:Date:Date:Date:' in lines[i]:
            custData['Fund Manager'] = lines[i+2]
        elif 'Priority:' in lines[i]:
            custData['Industry'] = lines[i+2]
            custData['Date'] = lines[i+1]
            custData['Deal Size']= lines [i+3]
        elif 'DEAL QUALIFYING MEMORANDUM' in lines[i]:
            custData['Owner'] = lines[i+2]
        elif 'Fund Manager' in lines[i]:
            custData['Investment Type'] = lines [i+2]
    print custData
    for item, descrip in custData.iteritems():
        worksheet.write(row, col,     item)
        worksheet.write(row+1, col, descrip)
        col += 1


cur_row = 0
for myFile in os.listdir(directory):
    if myFile.endswith(".pdf"):
        data_grabbing(os.path.join(directory, myFile), cur_row)
        cur_row +=-2
workbook.close()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.