Webscraping using selenium, beautifulsoup and python

Question

Currently scraping a real estate website that is using javascript. My process starts by scraping a list containing many different href links for single listings, appending these links to another list and then pressing the next button. I do this til the the next button is no longer clickable.

my problem is that after collecting all the listings (~13000 links) the scraper doesn't move onto the second part where it opens the links and gets the info I need. Selenium doesn't even open to move onto the first element of the list of links.

heres my code:

wait = WebDriverWait(driver, 10)
while True:
    try:
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        pass

After this I have another simple scraper that goes through the list of listings, opens them in selenium and collects data on that listing.

for links in houselinklist:
    print(links)
    newwebpage = links
    driver.get(newwebpage)
    html = driver.page_source
    soup = bs.BeautifulSoup(html,'html.parser')
    .
    .
    .
    . more code here

Where is the link you are scraping ?

ksai
– ksai

2017-07-31 05:01:07 +00:00
Commented Jul 31, 2017 at 5:01 — ksai
– ksai, Commented Jul 31, 2017 at 5:01
28hse.com/en/rent/house-type-g1

bathtubandatoaster
– bathtubandatoaster

2017-07-31 05:07:28 +00:00
Commented Jul 31, 2017 at 5:07 — bathtubandatoaster
– bathtubandatoaster, Commented Jul 31, 2017 at 5:07
What error did you get ?

ksai
– ksai

2017-07-31 05:26:27 +00:00
Commented Jul 31, 2017 at 5:26 — ksai
– ksai, Commented Jul 31, 2017 at 5:26

DJK · Accepted Answer · 2017-07-31 05:40:40Z

The problem is while True: creates a loop that runs infinity. Your except clause has a pass statement, which means once an error occurs, the loop just continues to run. Instead it can be written as

wait = WebDriverWait(driver, 10)
while True:
    try:
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        break # change this to exit loop

once an error occurs, the loop will break and move on to the next line of code

or you can just you can eliminate the while loop and just loop over your list of href links with a for loop

wait = WebDriverWait(driver, 10)
hrefLinks = ['link1','link2','link3'.....]
for link in hrefLinks:
    try:
        driver.get(link)
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        pass #pass on error and move on to next hreflink

Collectives™ on Stack Overflow

Webscraping using selenium, beautifulsoup and python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related