So I have the HTML from an NPR page, and I want to use regex to extract just certain URLs for me (these call the URLs to specific stories nested within the page). The actual links appear in the text (retrieved manually) as:
<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">
obviously, I cannot to continue to use manual retrieval if I want to be able to use this on a consistent basis. So far, I have this code:
import nltk
import re
f = open("/Users/shannonmcgregor/Desktop/npr.txt")
npr_lines = f.readlines()
f.close()
I have this code to grab everything between (
for line in npr_lines:
re.findall('<a href="?\'?([^"\'>]*)', line)
But that grabs all urls. I tried adding something like:
(parallels|thetwo-way|a-marines)
but that returns nothing. So what am I doing wrong? How I combine the larger URL stripper with these specific words that target the given URLs?
Please and thank you :)

/Users/shannonmcgregor/Desktop/npr.txtfile along with the expected output?