I need to find a specific word in the HTML of a list of pages. I'm using regex instead of BeautifulSoup, because I find it often easier.
The code is:
links= ['http://www-01.sil.org/iso639-3/documentation.asp?id=alr','http://www-01.sil.org/iso639-3/documentation.asp?id=ami', ...]
for link in links:
d = requests.get(link)
p = re.compile(r'<td valign=\"top\">Name:<\/td>\n\t+<td>\n\t+(\w+)\n\t+<\/td>')
lang = re.search(p, d.text)
This is a snippet of d.text:
<div id="main">
<h1>Documentation for ISO 639 identifier: bnn</h1>
<hr style="margin-bottom: 6pt">
<table>
<tr>
<td valign="top">Identifier:</td>
<td>bnn</td>
</tr>
<tr>
<td valign="top">Name:</td>
<td>
Bunun
</td>
</tr>
<tr>
<td valign="top">Status:</td>
<td>Active</td>
</tr>
I don't know why, but lang is None. I checked my regex pattern on regex101, and also on Sublime. I printed d.text, and the HTML is normal: if I put d.text in Sublime and search the same pattern, it works.
I don't understand why but the pattern doesn't work in the script, but everywhere else... I'm using Python3. I must be doing something silly, but I don't understand what...
d.textso we can try and reproduce the problem. Make it as small as possible. See How to create a Minimal, Complete, and Verifiable example.<code>tags stopped markdown interpretting the html tag in the regex, so it became visible.