How to extract specific value from HTML source with Python?

Question

I am learning Python 3.8 and trying to extract a specific portion of an HTML source document.

The HTML contains two lines that start with a keyword, followed by a value in double quotes:

keyword: "http://www.somesite.com/sample.txt"

What I need to extract is just the value between quotes if it follows the first instance of the keyword, so the output should be http://www.somesite.com/sample.txt.

In my code so far, I am trying to do so with a regex match, but it is not matching anything:

import re

import bs4
import pyperclip
import requests


def get_value(url):

    res = requests.get(url)
    res.raise_for_status()

    regex = re.compile("file: \"(http[^\s\"]+\.txt)\"")
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    return regex.search(soup.text).group().replace('file: "', '').replace('"', '')

# Print the URL from the clipboard
print(pyperclip.paste())

# Call get_value to return the required value between double quotes after file:
my_value = get_value(pyperclip.paste())

# Copy the final value to the clipboard
pyperclip.copy(my_value)

I get the following Python error: AttributeError: 'NoneType' object has no attribute 'group' when executing that.

I am not very familiar with regex but also believe there is a better way to extract this data as Stack Overflow's own RegEx Wiki suggests not using regex on HTML.

perl · Accepted Answer · 2021-03-07 16:47:12Z

3

+50

The error that you are getting is most likely due to no matches being found in regex.search, in which case it returns None, and calling .group method on None returns the error: AttributeError: 'NoneType' object has no attribute 'group'.

Without having the specific HTML sample you're working with it's hard to say why it doesn't match. Based on the example in the post, it could be because your regex pattern is looking for strings starting with file and HTML contains lines starting with keyword.

If there is a match in HTML, it should work. Here's an example:

import bs4
import re

html = """
<html>
    <body>
        <p>file: "http://www.somesite.com/sample1.txt"</p>
        <p>file: "http://www.somesite.com/sample2.txt"</p>
        <p>file: "http://www.somesite.com/non-matching.jpg"</p>
    </body>
</html>
"""

regex = re.compile("file: \"(http[^\s\"]+\.txt)\"")
soup = bs4.BeautifulSoup(html, 'html.parser')
regex.search(soup.text).group(1)

Output:

'http://www.somesite.com/sample1.txt'

edited Mar 7, 2021 at 16:47

answered Mar 7, 2021 at 10:15

perl

9,9811 gold badge14 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Zephyr Over a year ago

Thank you for this answer; it is close, I think. However, I am only interested in getting the first match found (instead of findall()) and the file extension should not matter.

perl Over a year ago

Right, so if you only need the first match, then you can use .search like you did in the original post, and then .group(1) to return the value in parentheses (I've updated my answer to do that)

perl Over a year ago

Didn't quite get your point on the file extension should not matter. If you want to match any extension, then your regex pattern should be different, i.e. without the .txt part: file: \"(http[^\s\"]+)\"

Zephyr Over a year ago

Looks great and does exactly what I need. Thank you.

Arpit · Accepted Answer · 2021-03-09 06:58:02Z

0

By seeing the part of string shared by you. Try using the below regex

regex = re.compile(r'file:.*\"(.*)\"')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
return regex.match(soup.text).group(1)

answered Mar 9, 2021 at 6:58

Arpit

1

Collectives™ on Stack Overflow

How to extract specific value from HTML source with Python?

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related