Python - Parsing input html data with beautifulsoup and store output data columnwise in csv file

Question

I extract html data from a mail and parse this data with beautifulsoup. Next, I want to store the parsed data under the right headers in the csv file. However, the text of the input data does not show accordingly in the output csv file.

Parsed input data (fruits_html) for csv file:

Apples                        43        0       0                   0<br/>
Bananas                     2282        0     500                   0<br/>
Grapes                      2534        0     500                   0<br/>
Oranges                      274        0       0                   0<br/>
--------------------------------------------------------------------------------------------------<br/>

Script:

# Parse raw messages to something readable
soup = BeautifulSoup(raw_email, 'html.parser')
fruits_html = soup.find_all('span')
headers = ["Names", "Quantity", "SpareQty", "MinQty", "MaxQty"]

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output, delimiter=',')
    csv_output.writerow(headers)
    for br in soup.find_all('span'):
        csv_output.writerow([fruits_html for br in br.find_all('br')])

Desired output:

I want to store all the quantities under the right header in the csv file. Unfortunately, my current output shows the headers in the first row, and in the second row a large number of <br/> in different cells.

Thats because you write only the found <br/> tags. A source to the fruits.html could help. — StefanMZ
– StefanMZ, Commented Mar 23, 2020 at 15:19
Thanks for your response. How can I do that? It is supposed to write fruits_html with csv_output.writerow([fruits_html for br in br.find_all('br')]) right? — Wizard
– Wizard, Commented Mar 23, 2020 at 15:28
stackoverflow.com/questions/30694558/… Without a sample input html i can't help you further. — StefanMZ
– StefanMZ, Commented Mar 23, 2020 at 15:39
I am sorry, is the input html not the right sample example? Could you tell me what you need instead? — Wizard
– Wizard, Commented Mar 23, 2020 at 16:19
See answer below. I thought that the code you provided was processed in some way (removed tags) — StefanMZ
– StefanMZ, Commented Mar 23, 2020 at 16:43

StefanMZ · Accepted Answer · 2020-03-23 16:41:59Z

2

import csv
from bs4 import BeautifulSoup
from bs4.element import NavigableString
data = '''
<html>
<span>

Apples                        43        0       0                   0<br/>
Bananas                     2282        0     500                   0<br/>
Grapes                      2534        0     500                   0<br/>
Oranges                      274        0       0                   0<br/>

</span>
</html>'''

soup = BeautifulSoup(data, 'html.parser')
#print(soup.find_all("span"))
headers = ["Names", "Quantity", "SpareQty", "MinQty", "MaxQty"]

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output, delimiter=',')
    csv_output.writerow(headers)
    for br in soup.find_all("span"):
        for item in br.contents:
            if type(item) is not NavigableString:
                continue
            csv_output.writerow(item.strip().split())

With output.csv

Names,Quantity,SpareQty,MinQty,MaxQty
Apples,43,0,0,0
Bananas,2282,0,500,0
Grapes,2534,0,500,0
Oranges,274,0,0,0

answered Mar 23, 2020 at 16:41

StefanMZ

4831 gold badge4 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Wizard Over a year ago

Thank you! This is exactly what I needed, so I accepted your answer. However, I would like to make an exception in the strip function. The first column can have names with a space in between. In some cases, there is green apples and others that are stripped while they should be kept in the first column together. How can I write that the strip should occur when at least 3 spaces occur?

StefanMZ Over a year ago

A regexp might work, but i'm not that good at that. There could be 2 solutions (beside regex) 1. Who sends you the mail could add a separating item like "|", or encase the objects in another tag 2. a = item.split(" ") # two spaces at lease csv_output.writerow([x.strip() for x in a]) But you must be sure that are always at least 2 spaces between elements.

Wizard Over a year ago

Appreciate your help. I am sure that I want to split from at least two spaces. But, the html file shows spaces, and the output csv shows ¬† as a space. So, when try to implement your second solution it doesnt read the spaces, nor will it find the ¬† when I intend to

Collectives™ on Stack Overflow

Python - Parsing input html data with beautifulsoup and store output data columnwise in csv file

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related