Web Scraping with Python
Virendra Rajput,
Hacker @Markitty
Agenda
● What is scraping
● Why we scrape
● My experiments with web scraping
● How do we do it
● Tools to use
● Online demo
● Some more tools
● Ethics for scraping
converting unstructured documents
into structured information
scraping:
What is Web Scraping?
● Web scraping (web harvesting) is a software
technique of extracting information from
websites
● It focuses on transformation of unstructured
data on the web (typically HTML), into
structured data that can be stored and
analyzed
RSS is meta data and not
HTML replacement
Why we scrape?
● Web pages contain wealth of information (in
text form), designed mostly for human
consumption
● Static websites (legacy systems)
● Interfacing with 3rd party with no API access
● Websites are more important than API’s
● The data is already available (in the form of
web pages)
● No rate limiting
● Anonymous access
How search engines use it
My Experiments with Scraping
and more..!
IMDb API
Did you mean!
Facebook Bot for Brahma
Kumaris
Getting started!
Fetching the data
● Involves finding the endpoint - URL or URL’s
● Sending HTTP requests to the server
● Using requests library:
import requests
data = requests.get(‘http://google.com/’)
html = data.content
Processing (say no to Reg-ex)
● use reg-ex
● Avoid using reg-ex
● Reasons why not to use it:
1. Its fragile
2. Really hard to maintain
3. Improper HTML & Encoding handling
Use BeautifulSoup for parsing
● Provides simple methods to-
○ search
○ navigate
○ select
● Deals with broken web-pages really well
● Auto-detects encoding
Philosophy-
“You didn't write that awful page. You're just trying to get
some data out of it. Beautiful Soup is here to help.”
Export the data
● Database (relational or non-relational)
● CSV
● JSON
● File (XML, YAML, etc.)
● API
Live example demo
Challenges
● External sites can change without warning
○ Figuring out the frequency is difficult (TEST, and
test)
○ Changes can break scrapers easily
● Bad HTTP status codes
○ example: using 200 OK to signal an error
○ cannot always trust your HTTP libraries default
behaviour
● Messy HTML markup
Mechanize
● Stateful web-browsing with
mechanize
○ Fill up forms
○ Follow links
○ Handle cookies
○ Browse history
● After Andy Lester’s WWW:
Mechanize
Filling forms with Mechanize
Scrapy - a framework for web
scraping
● Uses XPath to select elements
● Interactive shell scripting
● Using Scrapy:
○ define a model to store items
○ create your spider to extract items
○ write a Pipeline to store them
Conclusion
● Scrape wisely
● Do not steal
● Use cloud
● Share your scrapers scraperwiki.com
The End!
Virendra Rajput
http://virendra.me/
http://twitter.com/bkvirendra

Web scraping in python

  • 1.
    Web Scraping withPython Virendra Rajput, Hacker @Markitty
  • 2.
    Agenda ● What isscraping ● Why we scrape ● My experiments with web scraping ● How do we do it ● Tools to use ● Online demo ● Some more tools ● Ethics for scraping
  • 3.
    converting unstructured documents intostructured information scraping:
  • 4.
    What is WebScraping? ● Web scraping (web harvesting) is a software technique of extracting information from websites ● It focuses on transformation of unstructured data on the web (typically HTML), into structured data that can be stored and analyzed
  • 5.
    RSS is metadata and not HTML replacement
  • 6.
    Why we scrape? ●Web pages contain wealth of information (in text form), designed mostly for human consumption ● Static websites (legacy systems) ● Interfacing with 3rd party with no API access ● Websites are more important than API’s ● The data is already available (in the form of web pages) ● No rate limiting ● Anonymous access
  • 7.
  • 8.
  • 9.
    and more..! IMDb API Didyou mean! Facebook Bot for Brahma Kumaris
  • 10.
  • 11.
    Fetching the data ●Involves finding the endpoint - URL or URL’s ● Sending HTTP requests to the server ● Using requests library: import requests data = requests.get(‘http://google.com/’) html = data.content
  • 12.
    Processing (say noto Reg-ex) ● use reg-ex ● Avoid using reg-ex ● Reasons why not to use it: 1. Its fragile 2. Really hard to maintain 3. Improper HTML & Encoding handling
  • 13.
    Use BeautifulSoup forparsing ● Provides simple methods to- ○ search ○ navigate ○ select ● Deals with broken web-pages really well ● Auto-detects encoding Philosophy- “You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help.”
  • 14.
    Export the data ●Database (relational or non-relational) ● CSV ● JSON ● File (XML, YAML, etc.) ● API
  • 15.
  • 16.
    Challenges ● External sitescan change without warning ○ Figuring out the frequency is difficult (TEST, and test) ○ Changes can break scrapers easily ● Bad HTTP status codes ○ example: using 200 OK to signal an error ○ cannot always trust your HTTP libraries default behaviour ● Messy HTML markup
  • 17.
    Mechanize ● Stateful web-browsingwith mechanize ○ Fill up forms ○ Follow links ○ Handle cookies ○ Browse history ● After Andy Lester’s WWW: Mechanize
  • 18.
  • 19.
    Scrapy - aframework for web scraping ● Uses XPath to select elements ● Interactive shell scripting ● Using Scrapy: ○ define a model to store items ○ create your spider to extract items ○ write a Pipeline to store them
  • 20.
    Conclusion ● Scrape wisely ●Do not steal ● Use cloud ● Share your scrapers scraperwiki.com
  • 21.