Scrapey
A simple framework for solving common scraping problems
Install latest version
Add to Gemfile
gem "scrapey", :git => 'https://github.com/monkeysuffrage/scrapey.git'
Then run:
$ bundle install
Create a new scrapey project
$ scrapey my_scraper
Examples
CSV
By default scrapey will save as 'output.csv' You can change this with:
@output = 'mycsv.csv'
require 'scrapey'
page = get 'http://www.alexa.com/topsites'
page.search('li.site-listing').each do |li|
save [li.at('a').text, li.at('.description').text, li.at('.stars')[:title]]
end
Database
if you created a scrapey project you can fill out the database connection information in config/config.yml
require 'scrapey'
tables 'Movie', 'Actor' # create ActiveRecord models
page = get 'http://www.imdb.com/movies-in-theaters/'
page.search('div.list_item').each do |div|
movie = Movie.find_or_create_by_title div.at('h4 a').text
div.search('span[@itemprop="actors"] a').each do |a|
actor = Actor.find_or_create_by_name a.text
end
end
Caching
Scrapey can cache responses so that next time they don't hit the network
use_cache
You can use redis for caching if you have lots of memory
require 'redis'
use_cache :redis => Redis.new
Retries
Retry downloads on error a max of 3 times and sleep 30 seconds between retries.
get 'some_url', :retries => 3, :sleep => 30
Or just handle errors in an on_error method (Scrapey will call it automatically if it's defined)
def on_error e, method, url, , *args
puts "retrying #{url} again in 30 seconds..."
sleep 30
send method, url, , *args
end
Proxy switching
def on_error e, method, url, , *args
host, port = @config['proxies'].sample.split(':')
set_proxy host, port.to_i
send method, url, , *args
end
get 'some_throttled_website_url'
Concurrent downloads
Scrapey will ensure that the callbacks are threadsafe
require 'scrapey'
fields 'url', 'title'
def scrape url, response
doc = Nokogiri::HTML response
save({'url' => url, 'title' => doc.at('title').text})
end
multi_get ['http://www.yahoo.com/', 'http://www.google.com.', 'http://www.bing.com/'], :threads => 3, :on_success => :scrape