I am learning Python 3.8 and trying to extract a specific portion of an HTML source document.
The HTML contains two lines that start with a keyword, followed by a value in double quotes:
keyword: "http://www.somesite.com/sample.txt"
What I need to extract is just the value between quotes if it follows the first instance of the keyword, so the output should be http://www.somesite.com/sample.txt.
In my code so far, I am trying to do so with a regex match, but it is not matching anything:
import re
import bs4
import pyperclip
import requests
def get_value(url):
res = requests.get(url)
res.raise_for_status()
regex = re.compile("file: \"(http[^\s\"]+\.txt)\"")
soup = bs4.BeautifulSoup(res.text, 'html.parser')
return regex.search(soup.text).group().replace('file: "', '').replace('"', '')
# Print the URL from the clipboard
print(pyperclip.paste())
# Call get_value to return the required value between double quotes after file:
my_value = get_value(pyperclip.paste())
# Copy the final value to the clipboard
pyperclip.copy(my_value)
I get the following Python error: AttributeError: 'NoneType' object has no attribute 'group' when executing that.
I am not very familiar with regex but also believe there is a better way to extract this data as Stack Overflow's own RegEx Wiki suggests not using regex on HTML.