How to extract data from the HTML using python?

Question

How to extract the data from from the html ?

from urllib.request import urlopen
url = 'http://book.ponniyinselvan.in/part-1/chapter-1.html'
page = urlopen(url)

getting HTTPError: HTTP Error 403: Forbidden

I am trying to extract the data into CSV file.

Many web sites, to avoid rampant copyright violations, are only willing to work with genuine browsers. You should look into the User-Agent header to get around that. But this is a book of prose. How are going to shove that into a CSV? It doesn't seems like a smart approach. — Tim Roberts
– Tim Roberts, Commented Jun 7, 2021 at 17:02
Hi Tim,i found issues in while writing in csv , currently using txt instead of csv — Kum_R
– Kum_R, Commented Jun 10, 2021 at 10:27

Andrej Kesely · Accepted Answer · 2021-06-07 17:04:45Z

2

You can use this example how to save the text into a CSV file:

import csv
import requests
from bs4 import BeautifulSoup

url = "http://book.ponniyinselvan.in/part-1/chapter-1.html"

with open("data.csv", "w") as f_out:
    writer = csv.writer(f_out)

    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    text = soup.section.get_text(strip=True, separator="\n")

    writer.writerow(["Chapter", "Text"])
    writer.writerow([1, text])

Saves data.csv (screenshot from LibreOffice):

answered Jun 7, 2021 at 17:04

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kum_R Over a year ago

Thank you Andrej. is that possible to loop the extraction for all the pages automatically in that page?

Collectives™ on Stack Overflow

How to extract data from the HTML using python?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related