0

I have created a script that gets the href link from inside an anchor tag, along with the text.

Here is my python code:

import re
import cssselect
from lxml import html

mainTree = html.fromstring('<a href="https://www.example.com/laptops/" title="Laptops"><div class="subCategoryItem">Laptops <span class="cnv-items">(229)</span></div></a>')

for links in mainTree.cssselect('a'):
    urls = [links.get('href')]
    texts = re.findall(re.compile(u'[A-z- &]+'), links.text_content())

    for text in texts:
        print (text)

    for url in urls:
        print (url)

Output:

Laptops 
https://www.example.com/laptops/

Instead of using two for loops can I do this?

for text, url in texts, urls:
    print (text)
    print (url)
3
  • What happened when you tried it out? Commented Oct 14, 2015 at 17:16
  • 1
    @NathanielFord I get this: "ValueError: need more than 1 value to unpack". Commented Oct 14, 2015 at 17:18
  • I think this is a bit of an XY problem. The question you're asking about combined loops is indeed answered by zip as described by @kmad1729. However, I don't know why you're looping at all. You'll only ever have one URL per <a> tag, and so I suspect zip won't do what you want if you get multiple hits on your re.findall search (all but the first result will be ignored). Perhaps you just want to filter out inappropriate characters from the string returned by the text_content call? Commented Oct 14, 2015 at 17:23

2 Answers 2

3

You can use the zip function:

for text, url in zip(texts, urls):
    print (text)
    print (url)

What it does is zips two or more iterables. They need not be of the same size either.

>>> l1 = range(5)
>>> l2 = range(6)
>>> list(zip(l1,l2)) #produces
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
>>>
Sign up to request clarification or add additional context in comments.

1 Comment

zip is a great function! Note that here it is overkill and doesn't actually reduce the computational complexity.
2

Let's examine what you're trying to do here:

for text, url in texts, urls:
    print (text)
    print (url)

The text, url part right after the for indicates 'unpack the tuple indicated after in into two parts'. If the tuple doesn't have two parts you'll get a ValueError.

Both of the lists you're iterating through have single values, and simply putting a , between them won't do what you're looking for. As suggested in another answer, you can zip them into a single array:

for text, url in zip(texts, urls):
    print (text)
    print (url)

What zip does is return a list where each element is a tuple comprised of one element from each of the provided lists. This works well, but doesn't solve the problem of not looping through your list twice: you're still doing that, once for zip and once to unpack the zip. Your deeper problem is how you're getting your values.

You seem to be stepping through each link you have, and then for each link you are getting the url and the text and putting it into a list. You're then printing everything in those lists. Do those lists ever have a length greater than one?

The get function will only return a single value:

urls = [links.get('href')]  //Gets one value and puts it in a list of length one

Putting it into a list there is not meaningful. As for your regex search, it could in theory return multiple values, but if you use re.search(), then you'll only get the first match and don't need to worry about additional values. This is what you're currently doing:

for each link in the document
  put the url into a list
  put all the matching text into a list
  for each url in the list print it
  for each text in the list print it

When really you can simplify to:

for each link in the document
  print the url
  find the first text and print it

Then you don't have to worry about the additional for loops and zipping. This refactors to:

for links in mainTree.cssselect('a'):
    print(links.get('href'))
    print(re.search(re.compile(u'[A-z- &]+'), links.text_content()))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.