0

I would like to scrape the data of this web site ( http://www.oddsportal.com/matches/soccer ) in order to get a plain text file with the match info and the odds info in this way:

00:30   Criciuma - Atletico-PR                    1:2   2.70    3.24    2.41    
10:45   Vier-und Marschlande - Concordia Hamburg  0:0   4.00    3.53    1.68    
10:45   Germania Schnelsen - ASV Bergedorf 85     2:3   1.95    3.37    3.23    
10:45   Barmbecker SG - Altona                    0:2   3.67    3.37    1.82

I used to do this with w3m, but now it seems that they changed html to javascript and w3m does not work. Data are contained in only one div. this is one entry

<tr xeid="862487"><td class="table-time datet t1333724400-1-1-0-0 ">17:00</td><td class="name table-participant" colspan="2"><a href="/soccer/italy/serie-b-2011-2012/brescia-marmi-lanza-verona-862487/">Brescia - Verona</a></td><td class="odds-nowrp" xoid="40456791" xodd="xzc0fxzxa">-</td><td class="odds-nowrp" xoid="40456793" xodd="cz0ofxz9c">-</td><td class="odds-nowrp" xoid="40456792" xodd="cz9xfcztx">-</td><td class="center info-value">17</td></tr>

What can I do?

5
  • Can you provide more information about how they are using Javascript? That will dictate potential solutions. Commented Apr 6, 2012 at 13:40
  • I still see the values in the HTML source. Commented Apr 6, 2012 at 13:55
  • @Fenisko i can't. how is it possible? Commented Apr 6, 2012 at 13:56
  • No idea. In Firefox I can see the table in recognizable HTML. So I guess 20 minutes work with BeautifulSoup ;-). Commented Apr 6, 2012 at 14:02
  • @Fenisko - just because you can see it in Firefox does not mean it is in the response. Commented Apr 6, 2012 at 18:53

2 Answers 2

3

The easiest way (maybe not the best though) is to use selenium/watir. In ruby I would do:

require 'watir-webdriver'
require 'csv'
@browser = Watir::Browser.new
@browser.goto 'http://www.oddsportal.com/matches/soccer/'
CSV.open('out.csv', 'w') do |out|
    @browser.trs(:class => /deactivate/).each do |tr|
        out << tr.tds.map(&:text)
    end
end
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, there's also jruby and htmlunit. I think you'll find that /odd/ will only give odd numbered rows.
2

If they are using Javascript to get data from a service and render it within the DIV, W3M will not show the div updated with that data, because it does not support Javascript.

You have two choices:

  • Reverse-engineer their Javascript to find out where the data is coming from, and see if you can query that data source directly to get the XML or JSON they're using to update the DIV. Then you can skip the scraping entirely. They might not want you doing that, however, and may have secured the data source to prevent it. Or they might not have.

  • Use a browser which executes Javascript before you start your scraping. This way you'll have the div populated with the data. W3M-js might do this for you, or you might want to try something else (lynx or links). This question seems to be related.

ETA: Maybe PhantomJS would help here?

6 Comments

i don't know how to get data from their service. what do you means with "use a browser which executes Javascript before you start your scraping"? i need to do this in automatic way to collect data at different times.
If you look at the source JS which is building the content in their div, it might indicate where it's getting the data. You could get the same data (in XML or JSON) and skip the scraping if they haven't secured it. As far as the browser goes: because they're using JS to render the data, they're counting on their viewers having JS enabled. W3M does not support JS, so it's not rendering the data. I'll update my answer accordingly.
w3m-js seems that had disappeared from web :(
I agree with what you say except for the part about securing the data. If you can see the data in a browser, then you can scrape it.
Maybe. I can imagine the service being set up to require certain criteria (e.g. a cookie or similar session token) in the request; such criteria could certainly be imitated or spoofed somehow, but it would make regularly sipping data from the service somewhat less simple.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.