0

I need to find a specific word in the HTML of a list of pages. I'm using regex instead of BeautifulSoup, because I find it often easier.

The code is:

links= ['http://www-01.sil.org/iso639-3/documentation.asp?id=alr','http://www-01.sil.org/iso639-3/documentation.asp?id=ami', ...]
for link in links:
    d = requests.get(link)
    p = re.compile(r'<td valign=\"top\">Name:<\/td>\n\t+<td>\n\t+(\w+)\n\t+<\/td>')
    lang = re.search(p, d.text)

This is a snippet of d.text:

<div id="main">
<h1>Documentation for ISO 639 identifier: bnn</h1>
<hr style="margin-bottom: 6pt">

        <table>
            <tr>
                <td valign="top">Identifier:</td>
                <td>bnn</td>
            </tr>

                <tr>
                    <td valign="top">Name:</td>
                    <td>
                    Bunun
                    </td>
                </tr>

            <tr>
                <td valign="top">Status:</td>
                <td>Active</td>
            </tr>

I don't know why, but lang is None. I checked my regex pattern on regex101, and also on Sublime. I printed d.text, and the HTML is normal: if I put d.text in Sublime and search the same pattern, it works.
I don't understand why but the pattern doesn't work in the script, but everywhere else... I'm using Python3. I must be doing something silly, but I don't understand what...

8
  • 1
    Can you give a small example of the data d.text so we can try and reproduce the problem. Make it as small as possible. See How to create a Minimal, Complete, and Verifiable example. Commented Nov 17, 2015 at 12:19
  • 1
    Done, hope it's enough. Commented Nov 17, 2015 at 12:25
  • 1
    Are you certain there are no spaces in the HTML, in addition to the tabs? Is there a reason you're explicitly looking for a specific number of tabs rather than any length of whitespace? Commented Nov 17, 2015 at 12:32
  • 1
    @Aubrey that's great Commented Nov 17, 2015 at 13:44
  • 1
    @MarounMaroun Ah, I understand. The <code> tags stopped markdown interpretting the html tag in the regex, so it became visible. Commented Nov 17, 2015 at 13:46

2 Answers 2

4

One should be very careful with '\n'. File lines may finish with '\n' (Linux style), with '\r' (MacOS style) or both (Windows style). In your case it's easy to correct your expression accepting [\n\r]+ in place of \n and it works fine with your example links:

p = re.compile(r'<td valign="top">Name:</td>[\n\r]+\t+<td>[\n\r]+\t+(\w+)[\n\r]+\t+</td>')

However, I strongly advise against relying on any spacing structure in a document. What if they change it? It wouldn't ever be visible on site! I believe it's better to let spacing be free. Like the following:

p = re.compile(r'<td valign="top">Name:</td>\s*<td>\s*(\w+)\s*</td>')

It's also need to be noted that valign attribute is deprecated in HTML5 (CSS is to be used instead) and thus it may completely disappear from documents in near future.

Sign up to request clarification or add additional context in comments.

2 Comments

You do not have to escape / in Python regex. And " inside a single-quoted string literal either.
Backslashes removed (sorry my C++ :) )
1
p = re.compile(r'<td valign="top">Name:</td>\s+<td>\s+(\w+)\s+</td>')

as @Bryan Oakley mentioned, there are whitespaces between <td></td>, try \s+ to match one or more whitespaces. \s=[ \f\n\r\t\v]

besides, by using raw string notation, there is no need to use backslash to indicate special forms

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.