0

I am learning Python 3.8 and trying to extract a specific portion of an HTML source document.

The HTML contains two lines that start with a keyword, followed by a value in double quotes:

keyword: "http://www.somesite.com/sample.txt"

What I need to extract is just the value between quotes if it follows the first instance of the keyword, so the output should be http://www.somesite.com/sample.txt.

In my code so far, I am trying to do so with a regex match, but it is not matching anything:

import re

import bs4
import pyperclip
import requests


def get_value(url):

    res = requests.get(url)
    res.raise_for_status()

    regex = re.compile("file: \"(http[^\s\"]+\.txt)\"")
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    return regex.search(soup.text).group().replace('file: "', '').replace('"', '')

# Print the URL from the clipboard
print(pyperclip.paste())

# Call get_value to return the required value between double quotes after file:
my_value = get_value(pyperclip.paste())

# Copy the final value to the clipboard
pyperclip.copy(my_value)

I get the following Python error: AttributeError: 'NoneType' object has no attribute 'group' when executing that.

I am not very familiar with regex but also believe there is a better way to extract this data as Stack Overflow's own RegEx Wiki suggests not using regex on HTML.

2 Answers 2

3
+50

The error that you are getting is most likely due to no matches being found in regex.search, in which case it returns None, and calling .group method on None returns the error: AttributeError: 'NoneType' object has no attribute 'group'.

Without having the specific HTML sample you're working with it's hard to say why it doesn't match. Based on the example in the post, it could be because your regex pattern is looking for strings starting with file and HTML contains lines starting with keyword.

If there is a match in HTML, it should work. Here's an example:

import bs4
import re

html = """
<html>
    <body>
        <p>file: "http://www.somesite.com/sample1.txt"</p>
        <p>file: "http://www.somesite.com/sample2.txt"</p>
        <p>file: "http://www.somesite.com/non-matching.jpg"</p>
    </body>
</html>
"""

regex = re.compile("file: \"(http[^\s\"]+\.txt)\"")
soup = bs4.BeautifulSoup(html, 'html.parser')
regex.search(soup.text).group(1)

Output:

'http://www.somesite.com/sample1.txt'
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for this answer; it is close, I think. However, I am only interested in getting the first match found (instead of findall()) and the file extension should not matter.
Right, so if you only need the first match, then you can use .search like you did in the original post, and then .group(1) to return the value in parentheses (I've updated my answer to do that)
Didn't quite get your point on the file extension should not matter. If you want to match any extension, then your regex pattern should be different, i.e. without the .txt part: file: \"(http[^\s\"]+)\"
Looks great and does exactly what I need. Thank you.
0

By seeing the part of string shared by you. Try using the below regex

regex = re.compile(r'file:.*\"(.*)\"')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
return regex.match(soup.text).group(1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.