0

I'm trying to scrape this html table with BeautifulSoup on Python 3.6 in order to export it to csv, as in the scripts below. I used a former example, trying to fit my case.

url = 'http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/2050540010/cod/4/anno/2015/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/03' 
html =urlopen(url).read 
soup = BeautifulSoup(html(), "lxml") 
table = soup.select_one("table.tabfin") 
headers = [th.text("iso-8859-1") for th in table.select("tr th")]

but I receive an AttributeError.

AttributeError: 'NoneType' object has no attribute 'select'

Then I would try to export to csv with

with open("abano_spese.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerow(headers)
    wr.writerows([[td.text.encode("iso-8859-1") for td in row.find_all("td")] for row in table.select("tr + tr")])

What's wrong with this? I'm sorry if there's some stupid error, I'm an absolute beginner with python.

Thank you all

1 Answer 1

1

There is a problem with the scraping of the Web site of Ministero dell'Interno. Let's try this code:

url = 'http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/2050540010/cod/4/anno/2015/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/03'

html = urlopen(url).read()
soup = BeautifulSoup(html)
print soup.prettify()

You get:

La sua richiesta è stata bloccata dai sistemi posti a protezione del sito web.
Si prega di assicurarsi dell'integrità della postazione utilizzata e riprovare.

Scraping seems not welcome or they think that there is something nasty in your request, and that's the reason why table = None in your code and you get an AttributeError

Possible solution:

** Before starting anything else, please check if Ministero dell'Interno's data policy allows a script to consume their data, otherwise this is not the way to get what you need.**

Step 2: you can try to pass custom headers to your request to act as a browser. E.g.,

headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)"}
r  = requests.get(url, headers = headers)
soup = BeautifulSoup(r.text, 'lxml')

Now you have your soup. Note that you have 3 different <table class="tabfin"> in the page. I guess you need the second one:

table = soup.select("table.tabfin")[1]

In that way, it works. Excuse me if I sound a bit pedantic but I'm afraid that such an approach should be not compliant with their data license. Please, check it before scraping.

Sign up to request clarification or add additional context in comments.

3 Comments

I tried to pass a custom header as you suggested but i get the same response 'r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405'})' Unfortunately it seems that data are not available elsewhere.
Thank you @floatingpurr! Don't worry, you're right in being scrupulous. It seems there are no constraints on data manipulation, according to legal notes.
You are welcome @Alejo. Feel free to accept the answer if it helped

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.