Regex pattern not working in Python script

Question

I need to find a specific word in the HTML of a list of pages. I'm using regex instead of BeautifulSoup, because I find it often easier.

The code is:

links= ['http://www-01.sil.org/iso639-3/documentation.asp?id=alr','http://www-01.sil.org/iso639-3/documentation.asp?id=ami', ...]
for link in links:
    d = requests.get(link)
    p = re.compile(r'<td valign=\"top\">Name:<\/td>\n\t+<td>\n\t+(\w+)\n\t+<\/td>')
    lang = re.search(p, d.text)

This is a snippet of d.text:

<div id="main">
<h1>Documentation for ISO 639 identifier: bnn</h1>
<hr style="margin-bottom: 6pt">

        <table>
            <tr>
                <td valign="top">Identifier:</td>
                <td>bnn</td>
            </tr>

                <tr>
                    <td valign="top">Name:</td>
                    <td>
                    Bunun
                    </td>
                </tr>

            <tr>
                <td valign="top">Status:</td>
                <td>Active</td>
            </tr>

I don't know why, but lang is None. I checked my regex pattern on regex101, and also on Sublime. I printed d.text, and the HTML is normal: if I put d.text in Sublime and search the same pattern, it works.
I don't understand why but the pattern doesn't work in the script, but everywhere else... I'm using Python3. I must be doing something silly, but I don't understand what...

Can you give a small example of the data d.text so we can try and reproduce the problem. Make it as small as possible. See How to create a Minimal, Complete, and Verifiable example. — Open AI - Opting Out
– Open AI - Opting Out, Commented Nov 17, 2015 at 12:19
Are you certain there are no spaces in the HTML, in addition to the tabs? Is there a reason you're explicitly looking for a specific number of tabs rather than any length of whitespace? — Bryan Oakley
– Bryan Oakley, Commented Nov 17, 2015 at 12:32
@MarounMaroun Ah, I understand. The <code> tags stopped markdown interpretting the html tag in the regex, so it became visible. — Open AI - Opting Out
– Open AI - Opting Out, Commented Nov 17, 2015 at 13:46

AndreyS Scherbakov · Accepted Answer · 2015-11-17 12:53:44Z

4

One should be very careful with '\n'. File lines may finish with '\n' (Linux style), with '\r' (MacOS style) or both (Windows style). In your case it's easy to correct your expression accepting [\n\r]+ in place of \n and it works fine with your example links:

p = re.compile(r'<td valign="top">Name:</td>[\n\r]+\t+<td>[\n\r]+\t+(\w+)[\n\r]+\t+</td>')

However, I strongly advise against relying on any spacing structure in a document. What if they change it? It wouldn't ever be visible on site! I believe it's better to let spacing be free. Like the following:

p = re.compile(r'<td valign="top">Name:</td>\s*<td>\s*(\w+)\s*</td>')

It's also need to be noted that valign attribute is deprecated in HTML5 (CSS is to be used instead) and thus it may completely disappear from documents in near future.

edited Nov 17, 2015 at 12:53

answered Nov 17, 2015 at 12:38

AndreyS Scherbakov

2,8082 gold badges23 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wiktor Stribiżew Over a year ago

You do not have to escape / in Python regex. And " inside a single-quoted string literal either.

AndreyS Scherbakov Over a year ago

Backslashes removed (sorry my C++ :) )

Hooting · Accepted Answer · 2015-11-17 12:42:08Z

1

p = re.compile(r'<td valign="top">Name:</td>\s+<td>\s+(\w+)\s+</td>')

as @Bryan Oakley mentioned, there are whitespaces between <td></td>, try \s+ to match one or more whitespaces. \s=[ \f\n\r\t\v]

besides, by using raw string notation, there is no need to use backslash to indicate special forms

edited Nov 17, 2015 at 12:42

answered Nov 17, 2015 at 12:36

Hooting

1,72111 silver badges20 bronze badges

Collectives™ on Stack Overflow

Regex pattern not working in Python script

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related