1

I extract html data from a mail and parse this data with beautifulsoup. Next, I want to store the parsed data under the right headers in the csv file. However, the text of the input data does not show accordingly in the output csv file.

Parsed input data (fruits_html) for csv file:

Apples                        43        0       0                   0<br/>
Bananas                     2282        0     500                   0<br/>
Grapes                      2534        0     500                   0<br/>
Oranges                      274        0       0                   0<br/>
--------------------------------------------------------------------------------------------------<br/>

Script:

# Parse raw messages to something readable
soup = BeautifulSoup(raw_email, 'html.parser')
fruits_html = soup.find_all('span')
headers = ["Names", "Quantity", "SpareQty", "MinQty", "MaxQty"]

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output, delimiter=',')
    csv_output.writerow(headers)
    for br in soup.find_all('span'):
        csv_output.writerow([fruits_html for br in br.find_all('br')])

Desired output:

I want to store all the quantities under the right header in the csv file. Unfortunately, my current output shows the headers in the first row, and in the second row a large number of <br/> in different cells.

5
  • Thats because you write only the found <br/> tags. A source to the fruits.html could help. Commented Mar 23, 2020 at 15:19
  • Thanks for your response. How can I do that? It is supposed to write fruits_html with csv_output.writerow([fruits_html for br in br.find_all('br')]) right? Commented Mar 23, 2020 at 15:28
  • stackoverflow.com/questions/30694558/… Without a sample input html i can't help you further. Commented Mar 23, 2020 at 15:39
  • I am sorry, is the input html not the right sample example? Could you tell me what you need instead? Commented Mar 23, 2020 at 16:19
  • 1
    See answer below. I thought that the code you provided was processed in some way (removed tags) Commented Mar 23, 2020 at 16:43

1 Answer 1

2
import csv
from bs4 import BeautifulSoup
from bs4.element import NavigableString
data = '''
<html>
<span>

Apples                        43        0       0                   0<br/>
Bananas                     2282        0     500                   0<br/>
Grapes                      2534        0     500                   0<br/>
Oranges                      274        0       0                   0<br/>

</span>
</html>'''

soup = BeautifulSoup(data, 'html.parser')
#print(soup.find_all("span"))
headers = ["Names", "Quantity", "SpareQty", "MinQty", "MaxQty"]

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output, delimiter=',')
    csv_output.writerow(headers)
    for br in soup.find_all("span"):
        for item in br.contents:
            if type(item) is not NavigableString:
                continue
            csv_output.writerow(item.strip().split())

With output.csv

Names,Quantity,SpareQty,MinQty,MaxQty
Apples,43,0,0,0
Bananas,2282,0,500,0
Grapes,2534,0,500,0
Oranges,274,0,0,0
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you! This is exactly what I needed, so I accepted your answer. However, I would like to make an exception in the strip function. The first column can have names with a space in between. In some cases, there is green apples and others that are stripped while they should be kept in the first column together. How can I write that the strip should occur when at least 3 spaces occur?
A regexp might work, but i'm not that good at that. There could be 2 solutions (beside regex) 1. Who sends you the mail could add a separating item like "|", or encase the objects in another tag 2. a = item.split(" ") # two spaces at lease csv_output.writerow([x.strip() for x in a]) But you must be sure that are always at least 2 spaces between elements.
Appreciate your help. I am sure that I want to split from at least two spaces. But, the html file shows spaces, and the output csv shows   as a space. So, when try to implement your second solution it doesnt read the spaces, nor will it find the   when I intend to

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.