1

How to extract the data from from the html ?

from urllib.request import urlopen
url = 'http://book.ponniyinselvan.in/part-1/chapter-1.html'
page = urlopen(url)

getting HTTPError: HTTP Error 403: Forbidden

I am trying to extract the data into CSV file.

2
  • Many web sites, to avoid rampant copyright violations, are only willing to work with genuine browsers. You should look into the User-Agent header to get around that. But this is a book of prose. How are going to shove that into a CSV? It doesn't seems like a smart approach. Commented Jun 7, 2021 at 17:02
  • Hi Tim,i found issues in while writing in csv , currently using txt instead of csv Commented Jun 10, 2021 at 10:27

1 Answer 1

2

You can use this example how to save the text into a CSV file:

import csv
import requests
from bs4 import BeautifulSoup

url = "http://book.ponniyinselvan.in/part-1/chapter-1.html"

with open("data.csv", "w") as f_out:
    writer = csv.writer(f_out)

    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    text = soup.section.get_text(strip=True, separator="\n")

    writer.writerow(["Chapter", "Text"])
    writer.writerow([1, text])

Saves data.csv (screenshot from LibreOffice):

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Andrej. is that possible to loop the extraction for all the pages automatically in that page?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.