1

I have a string that can contain links:

<a href="http://site1.com/">Hello</a> <a href="http://site2.com/">Hello2</a>
<a href="http://site3.com">Hello3</a> ...

How can I extract the text (not the link) of all html tags "Hello", "Hello2", "Hello3" ... ? I'm thinking of a list that should contain all texts.

2
  • you want to look into the BeautifulSoup library Commented Nov 16, 2012 at 22:48
  • Never use regex for parsing! Never! Commented Nov 17, 2012 at 9:01

1 Answer 1

1

Using lxml:

import lxml.html as LH

content = '''
<a href="http://site1.com/">Hello</a> <a href="http://site2.com/">Hello2</a>
<a href="http://site3.com">Hello3</a>
<a href="/">go <b>home</b>, dude!</a>
'''

doc = LH.fromstring(content)
texts = [elt.text_content() for elt in doc.xpath('//a')]
print(texts)

yields

['Hello', 'Hello2', 'Hello3', 'go home, dude!']
Sign up to request clarification or add additional context in comments.

5 Comments

Please don't use /text(), it's a code smell. In particular, it will do funny things on links like <a href="/">go <b>home</b>, dude!</a>
I'd do //a/string(). Is your version equivalent?
I just tried that; for some reason lxml raises lxml.etree.XPathEvalError: Invalid expression.
@larsmans: But to answer your question, yes, text_content() will return all the text between <a> and </a> with no markup.
string() is probably XPath 2.0, LXML only supports 1.0. +1 for a clean solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.