Class: CraigScrape::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/scraper.rb

Overview

Scraper is a general-pupose base class for all libcraigscrape Objects. Scraper facilitates all http-related functionality, and adds some useful helpers for dealing with eager-loading of http-objects and general html methods. It also contains the http-related cattr_accessors:

logger - a Logger object to debug http notices too. Defaults to nil

retries_on_fetch_fail - The number of times to retry a failed uri download. Defaults to 8

sleep_between_fetch_retries - The amount of seconds to sleep, between successive attempts in the case of a failed download. Defaults to 30.

retries_on_404_fail - The number of times to retry a Resource Not Found error (http Response code 404). Defaults to 3.

sleep_between_404_retries - The amount of seconds to sleep, between successive attempts in the case of a Resource Not Found error. Defaults to 3.

Direct Known Subclasses

GeoListings, Listings, Posting

Defined Under Namespace

Classes: BadConstructionError, BadUrlError, FetchError, ParseError

Constant Summary collapse

URL_PARTS =
/^(?:([^\:]+)\:\/\/([^\/]*))?(.*)$/
HTML_TAG =
/<\/?[^>]*>/
HTML_ENCODING =

We have to specify this to nokogiri. Sometimes it tries to figure out encoding on its own, and craigslist users post crazy bytes sometimes

"UTF-8"
HTTP_HEADERS =
{ "Cache-Control" => "no-cache", "Pragma" => "no-cache", 
"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19"}

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(init_via = nil) ⇒ Scraper

Scraper Objects can be created from either a full URL (string), or a Hash. Currently, this initializer isn’t intended to be called from libcraigslist API users, though if you know what you’re doing - feel free to try this out.

A (string) url can be passed in a ‘http://’ scheme or a ‘file://’ scheme.

When constructing from a hash, the keys in the hash will be used to set the object’s corresponding values. This is useful to create an object without actually making an html request, this is used to set-up an object before it eager-loads any values not already passed in by the constructor hash. Though optional, if you’re going to be setting this object up for eager-loadnig, be sure to pass in a :url key in your hash, Otherwise this will fail to eager load.



65
66
67
68
69
70
71
72
73
74
75
# File 'lib/scraper.rb', line 65

def initialize(init_via = nil)
  if init_via.nil?
    # Do nothing - possibly not a great idea, but we'll allow it
  elsif init_via.kind_of? String
    @url = init_via
  elsif init_via.kind_of? Hash
    init_via.each_pair{|k,v| instance_variable_set "@#{k}", v}
  else
    raise BadConstructionError, ("Unrecognized parameter passed to %s.new %s}" % [self.class.to_s, init_via.class.inspect])
  end
end

Instance Attribute Details

#urlObject (readonly)

Returns the full url that corresponds to this resource



40
41
42
# File 'lib/scraper.rb', line 40

def url
  @url
end

Instance Method Details

#downloaded?Boolean

Indicates whether the resource has yet been retrieved from its associated url. This is useful to distinguish whether the instance was instantiated for the purpose of an eager-load, but hasn’t yet been fetched.

Returns:

  • (Boolean)


80
# File 'lib/scraper.rb', line 80

def downloaded?; !@html_source.nil?; end

#uriObject

A URI object corresponding to this Scraped URL



83
84
85
86
# File 'lib/scraper.rb', line 83

def uri
  @uri ||= URI.parse @url if @url
  @uri
end