I am trying to parse a particular set of links from a html file, but since I am using HTMLParser I cannot access information of the html in a Hierarchy Tree and hence I cannot extract the information.
My HTML is as follows :
<p class="mediatitle">
<a class="bullet medialink" href="link/to/a/file">Some Content
</a>
</p>
So what I need is to extract all the values which have its key as 'href' and the previous attribute as class="bullet medialink". In other words I want only thode hrefs which are present in a tag with of class 'bullet medialink'
What I tried so far is
from HTMLParser import HTMLParser
import urllib
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if(tag == 'a'):
for (key,value) in attrs:
if(value == 'bullet medialink'):
print "attr:", key
p = MyHTMLParser()
f = urllib.urlopen("sample.html")
html = f.read()
p.feed(html)
p.close()