Html-table scraping and exporting to csv: attribute error

Question

I'm trying to scrape this html table with BeautifulSoup on Python 3.6 in order to export it to csv, as in the scripts below. I used a former example, trying to fit my case.

url = 'http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/2050540010/cod/4/anno/2015/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/03' 
html =urlopen(url).read 
soup = BeautifulSoup(html(), "lxml") 
table = soup.select_one("table.tabfin") 
headers = [th.text("iso-8859-1") for th in table.select("tr th")]

but I receive an AttributeError.

AttributeError: 'NoneType' object has no attribute 'select'

Then I would try to export to csv with

with open("abano_spese.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerow(headers)
    wr.writerows([[td.text.encode("iso-8859-1") for td in row.find_all("td")] for row in table.select("tr + tr")])

What's wrong with this? I'm sorry if there's some stupid error, I'm an absolute beginner with python.

Thank you all

floatingpurr · Accepted Answer · 2017-09-12 10:23:51Z

1

There is a problem with the scraping of the Web site of Ministero dell'Interno. Let's try this code:

url = 'http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/2050540010/cod/4/anno/2015/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/03'

html = urlopen(url).read()
soup = BeautifulSoup(html)
print soup.prettify()

You get:

La sua richiesta è stata bloccata dai sistemi posti a protezione del sito web.
Si prega di assicurarsi dell'integrità della postazione utilizzata e riprovare.

Scraping seems not welcome or they think that there is something nasty in your request, and that's the reason why table = None in your code and you get an AttributeError

Possible solution:

** Before starting anything else, please check if Ministero dell'Interno's data policy allows a script to consume their data, otherwise this is not the way to get what you need.**

Step 2: you can try to pass custom headers to your request to act as a browser. E.g.,

headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)"}
r  = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')

Now you have your soup. Note that you have 3 different <table class="tabfin"> in the page. I guess you need the second one:

table = soup.select("table.tabfin")[1]

In that way, it works. Excuse me if I sound a bit pedantic but I'm afraid that such an approach should be not compliant with their data license. Please, check it before scraping.

edited Sep 12, 2017 at 10:23

answered Sep 11, 2017 at 11:59

floatingpurr

8,72911 gold badges58 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alejo Over a year ago

I tried to pass a custom header as you suggested but i get the same response 'r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'})' Unfortunately it seems that data are not available elsewhere.

Alejo Over a year ago

Thank you @floatingpurr! Don't worry, you're right in being scrupulous. It seems there are no constraints on data manipulation, according to legal notes.

floatingpurr Over a year ago

You are welcome @Alejo. Feel free to accept the answer if it helped

Collectives™ on Stack Overflow

Html-table scraping and exporting to csv: attribute error

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related