Almost Scraping: Web Scraping  for Non-Programmers Michelle Minkoff, PBSNews.org Matt Wynn, Omaha World-Herald
What is Web scraping? The *all-knowing* Wikipedia says: “ Web scraping  (also called  Web harvesting  or  Web data extraction ) is a computer software technique of extracting information from websites. …Web scraping focuses more on the transformation of unstructured Web content, typically in  HTML  format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration.”
Why do I want to Web scrape? Journalists like to find stories Editors like stories that are exclusive Downloading a dataset is like going to a press conference, anyone can grab and use it. Web scraping is like an enterprise story, less likely to be picked up by all. Puts more control back into your hands
What kind of data can I get? Laws (Summary of same-sex marriage laws for each state, pdfs) Photos (pictures associated with all players on a team you’re highlighting, all mayoral candidates) Recipe ingredients (NYT story about peanut butter) Health care (see ProPublica’s Dollars for Docs project) Links, images, dates, names, categories, tags, anything with some sort of repeatable structure
DownThemAll http://www.downthemall.net
Yahoo Pipes http://pipes.yahoo.com/pipes
Yahoo Pipes Access and manipulate RSS feeds, which are often a flurry of information Sort, filter, combine your information Format that info to fit your needs (date formatter)
Yahoo Pipes Pair with Versionista, which can create an RSS feed of changes to a Web site to keep tabs on what’s changing.  This was done to great effect by ProPublica’s team in late 2009, esp. by Scott Klein and then-intern Brian Boyer, now at Chicago Tribune
ScraperWiki http://scraperwiki.com
Needlebase http://needlebase.com
Needlebase For sites that follow a repetitive formula spanning multiple pages, like index pg & detail page, maybe with a search results page in the middle Like a good employee, train it once, and then let it churn.
Needlebase Query, select and filter your data in the Web app, then export in format of your choice. Can check your data and stay up-to-date on your data set Will go more in depth on Needle in Sat.’s hands-on lab at 10 a.m.
InfoExtractor http://www.infoextractor.org
irobotsoft http://irobotsoft.com
Imacros https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/
Imacros Record repetitive tasks that you do every day, and keep them as a data set Think of it like a bookmark, but if you could include logging in, or entering a search term, as part of that bookmark Useful for stats you check every day, scores for your local sports team, stocks if you’re a biz reporter, etc. More complex function allows you to extract multiple data points on a page, like from an HTML table.
OutwitHub http://www.outwit.com/products/hub
OutwitHub Versatile Firefox extension Can use it for certain defaults (links, images)
OutwitHub Dig through the HTML hierarchy tree Structural elements (<h3>) Stylistic elements (<strong>) Download list of attached files or files themselves More options if you buy Pro version Will discuss in-depth and use in hands-on lab on Saturday at 10 am
Python
Wrap-Up Non-programming scrapers can’t do everything, but have the power to get you started.  Some  say “Program or be programmed,” but this is a compromise. Legal permissions still apply, so don’t use scraped info you don’t have the right to. Something to consider.  How does this apply to what you do every day, and how scraping could contribute to your job? “ The businesses that win will be those that understand how to build value from data from wherever it comes. Information isn’t power. The right information is.” – media consultant Neil Perkin wrote in  Marketing Week

Almost Scraping: Web Scraping without Programming

  • 1.
    Almost Scraping: WebScraping for Non-Programmers Michelle Minkoff, PBSNews.org Matt Wynn, Omaha World-Herald
  • 2.
    What is Webscraping? The *all-knowing* Wikipedia says: “ Web scraping (also called Web harvesting or Web data extraction ) is a computer software technique of extracting information from websites. …Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration.”
  • 3.
    Why do Iwant to Web scrape? Journalists like to find stories Editors like stories that are exclusive Downloading a dataset is like going to a press conference, anyone can grab and use it. Web scraping is like an enterprise story, less likely to be picked up by all. Puts more control back into your hands
  • 4.
    What kind ofdata can I get? Laws (Summary of same-sex marriage laws for each state, pdfs) Photos (pictures associated with all players on a team you’re highlighting, all mayoral candidates) Recipe ingredients (NYT story about peanut butter) Health care (see ProPublica’s Dollars for Docs project) Links, images, dates, names, categories, tags, anything with some sort of repeatable structure
  • 5.
  • 6.
  • 7.
    Yahoo Pipes Accessand manipulate RSS feeds, which are often a flurry of information Sort, filter, combine your information Format that info to fit your needs (date formatter)
  • 8.
    Yahoo Pipes Pairwith Versionista, which can create an RSS feed of changes to a Web site to keep tabs on what’s changing. This was done to great effect by ProPublica’s team in late 2009, esp. by Scott Klein and then-intern Brian Boyer, now at Chicago Tribune
  • 9.
  • 10.
  • 11.
    Needlebase For sitesthat follow a repetitive formula spanning multiple pages, like index pg & detail page, maybe with a search results page in the middle Like a good employee, train it once, and then let it churn.
  • 12.
    Needlebase Query, selectand filter your data in the Web app, then export in format of your choice. Can check your data and stay up-to-date on your data set Will go more in depth on Needle in Sat.’s hands-on lab at 10 a.m.
  • 13.
  • 14.
  • 15.
  • 16.
    Imacros Record repetitivetasks that you do every day, and keep them as a data set Think of it like a bookmark, but if you could include logging in, or entering a search term, as part of that bookmark Useful for stats you check every day, scores for your local sports team, stocks if you’re a biz reporter, etc. More complex function allows you to extract multiple data points on a page, like from an HTML table.
  • 17.
  • 18.
    OutwitHub Versatile Firefoxextension Can use it for certain defaults (links, images)
  • 19.
    OutwitHub Dig throughthe HTML hierarchy tree Structural elements (<h3>) Stylistic elements (<strong>) Download list of attached files or files themselves More options if you buy Pro version Will discuss in-depth and use in hands-on lab on Saturday at 10 am
  • 20.
  • 21.
    Wrap-Up Non-programming scraperscan’t do everything, but have the power to get you started. Some say “Program or be programmed,” but this is a compromise. Legal permissions still apply, so don’t use scraped info you don’t have the right to. Something to consider. How does this apply to what you do every day, and how scraping could contribute to your job? “ The businesses that win will be those that understand how to build value from data from wherever it comes. Information isn’t power. The right information is.” – media consultant Neil Perkin wrote in Marketing Week