Class: CraigScrape::Posting

Inherits:

Scraper

Object
Scraper
CraigScrape::Posting

show all

Defined in:: lib/posting.rb

Overview

Posting represents a fully downloaded, and parsed, Craigslist post. This class is generally returned by the listing scrape methods, and contains the post summaries for a specific search url, or a general listing category

Constant Summary collapse

POST_DATE =

/Date:[^\d]*((?:[\d]{2}|[\d]{4})\-[\d]{1,2}\-[\d]{1,2}[^\d]+[\d]{1,2}\:[\d]{1,2}[ ]*[AP]M[^a-z]+[a-z]+)/i

LOCATION =

/Location\:[ ]+(.+)/

HEADER_LOCATION =

/^.+[ ]*\-[ ]*[\$]?[\d]+[ ]*\((.+)\)$/

POSTING_ID =

/PostingID\:[ ]*([\d]+)/

REPLY_TO =

/(.+)/

PRICE =

/((?:^\$[\d]+(?:\.[\d]{2})?)|(?:\$[\d]+(?:\.[\d]{2})?$))/

USERBODY_PARTS = NOTE: we implement the (?:) to first check the ‘old’ style format, and then the ‘new style’ (As of 12/03’s parse changes)

/^(.+)\<div id\=\"userbody\">(.+)\<br[ ]*[\/]?\>\<br[ ]*[\/]?\>(.+)\<\/div\>(.+)$/m

HTML_HEADER =

/^(.+)\<div id\=\"userbody\">/m

IMAGE_SRC =

/\<im[a]?g[e]?[^\>]*src=(?:\'([^\']+)\'|\"([^\"]+)\"|([^ ]+))[^\>]*\>/

REQUIRED_FIELDS = This is used to determine if there’s a parse error

%w(contents posting_id post_time header title full_section)

XPATH_USERBODY =

"//*[@id='userbody']"

XPATH_BLURBS =

"//ul[@class='blurbs']"

XPATH_PICS =

"//*[@class='tn']/a/@href"

XPATH_REPLY_TO =

"//*[@class='dateReplyBar']/small/a"

Constants inherited from Scraper

Scraper::HTML_ENCODING, Scraper::HTML_TAG, Scraper::HTTP_HEADERS, Scraper::URL_PARTS

Instance Attribute Summary collapse

#href ⇒ Object readonly

This is really just for testing, in production use, uri.path is a better solution.

Attributes inherited from Scraper

#url

Instance Method Summary collapse

#contents ⇒ Object

String, The full-html contents of the post.
#contents_as_plain ⇒ Object

Returns the post contents with all html tags removed.
#deleted_by_author? ⇒ Boolean

Returns true if this Post was parsed, and represents a ‘This posting has been deleted by its author.’ notice.
#flagged_for_removal? ⇒ Boolean

Returns true if this Post was parsed, and merely a ‘Flagged for Removal’ page.
#full_section ⇒ Object

Array, hierarchial representation of the posts section.
#has_img? ⇒ Boolean

true if post summary has ‘img(s)’.
#has_pic? ⇒ Boolean

true if post summary has ‘pic(s)’.
#has_pic_or_img? ⇒ Boolean

true if post summary has either the img or pic label This is always able to be pulled from the listing post-summary, and should never cause an additional page load.
#header ⇒ Object

String, The contents of the item’s html body heading.
#header_as_plain ⇒ Object

Returns the header with all html tags removed.
#images ⇒ Object

Array, urls of the post’s images that are not hosted on craigslist.
#img_types ⇒ Object

Array, which image types are listed for the post.
#initialize(*args) ⇒ Posting constructor

Create a new Post via a url (String), or supplied parameters (Hash).
#is_active_post? ⇒ Boolean

This is mostly used to determine if the post should be checked for parse errors.
#label ⇒ Object

Returns The post label.
#location ⇒ Object

String, the location of the item, as best could be parsed.
#pics ⇒ Object

Array, urls of the post’s craigslist-hosted images.
#post_date ⇒ Object

Reflects only the date portion of the posting.
#post_time ⇒ Object

Time, reflects the full timestamp of the posting.
#posting_has_expired? ⇒ Boolean

Returns true if this Post was parsed, and represents a ‘This posting has expired.’ notice.
#posting_id ⇒ Object

Integer, Craigslist’s unique posting id.
#price ⇒ Object

Returns the best-guess of a price, judging by the label’s contents.
#reply_to ⇒ Object

String, represents the post’s reply-to address, if listed.
#section ⇒ Object

Retrieves the most-relevant craigslist ‘section’ of the post.
#system_post? ⇒ Boolean

Some posts (deleted_by_author, flagged_for_removal) are common template posts that craigslist puts up in lieu of an original This returns true or false if that case applies.
#title ⇒ Object

String, the item’s title.

Methods inherited from Scraper

#downloaded?, #uri

Constructor Details

#initialize(*args) ⇒ `Posting`

Create a new Post via a url (String), or supplied parameters (Hash)

# File 'lib/posting.rb', line 39

def initialize(*args)
  super(*args)

  # Validate that required fields are present, at least - if we've downloaded it from a url
  if args.first.kind_of? String and is_active_post?
    unparsed_fields = REQUIRED_FIELDS.find_all{|f| 
      val = send(f)
      val.nil? or (val.respond_to? :length and val.length == 0)
    } 
    parse_error! unparsed_fields unless unparsed_fields.empty?
  end  

end

Instance Attribute Details

#href ⇒ `Object` (readonly)

This is really just for testing, in production use, uri.path is a better solution



36
37
38

# File 'lib/posting.rb', line 36

def href
  @href
end

Instance Method Details

#contents ⇒ `Object`

String, The full-html contents of the post

# File 'lib/posting.rb', line 133

def contents
  unless @contents
    @contents = user_body if html_source
    @contents = he_decode(@contents).strip if @contents
  end
  
  @contents
end

#contents_as_plain ⇒ `Object`

Returns the post contents with all html tags removed



304
305
306

# File 'lib/posting.rb', line 304

def contents_as_plain
  strip_html contents
end

#deleted_by_author? ⇒ `Boolean`

Returns true if this Post was parsed, and represents a ‘This posting has been deleted by its author.’ notice

Returns:

(Boolean)

# File 'lib/posting.rb', line 222

def deleted_by_author?
  @deleted_by_author = (
    system_post? and header_as_plain == "This posting has been deleted by its author."
  ) if @deleted_by_author.nil?
  
  @deleted_by_author
end

#flagged_for_removal? ⇒ `Boolean`

Returns true if this Post was parsed, and merely a ‘Flagged for Removal’ page

Returns:

(Boolean)

# File 'lib/posting.rb', line 213

def flagged_for_removal?
  @flagged_for_removal = (
    system_post? and header_as_plain == "This posting has been flagged for removal"
  ) if @flagged_for_removal.nil?
  
  @flagged_for_removal
end

#full_section ⇒ `Object`

Array, hierarchial representation of the posts section

# File 'lib/posting.rb', line 76

def full_section
  unless @full_section
    @full_section = []
    
    (html_head / "*[@class='bchead']//a").each do |a|
      @full_section << he_decode(a.inner_html) unless a['id'] and a['id'] == 'ef'
    end if html_head
  end

  @full_section
end

#has_img? ⇒ `Boolean`

true if post summary has ‘img(s)’. ‘imgs’ are different then pics, in that the resource is not hosted on craigslist’s server. This is always able to be pulled from the listing post-summary, and should never cause an additional page load

Returns:

(Boolean)



281
282
283

# File 'lib/posting.rb', line 281

def has_img?
  img_types.include? :img
end

#has_pic? ⇒ `Boolean`

true if post summary has ‘pic(s)’. ‘pics’ are different then imgs, in that craigslist is hosting the resource on craigslist’s servers This is always able to be pulled from the listing post-summary, and should never cause an additional page load

Returns:

(Boolean)



287
288
289

# File 'lib/posting.rb', line 287

def has_pic?
  img_types.include? :pic
end

#has_pic_or_img? ⇒ `Boolean`

true if post summary has either the img or pic label This is always able to be pulled from the listing post-summary, and should never cause an additional page load

Returns:

(Boolean)



293
294
295

# File 'lib/posting.rb', line 293

def has_pic_or_img?
  img_types.length > 0
end

#header ⇒ `Object`

String, The contents of the item’s html body heading

# File 'lib/posting.rb', line 55

def header
  unless @header
    h2 = html_head.at 'h2' if html_head
    @header = he_decode h2.inner_html if h2
  end
  
  @header
end

#header_as_plain ⇒ `Object`

Returns the header with all html tags removed. Granted, the header should usually be plain, but in the case of a ‘system_post’ we may get tags in here



310
311
312

# File 'lib/posting.rb', line 310

def header_as_plain
  strip_html header
end

#images ⇒ `Object`

Array, urls of the post’s images that are not hosted on craigslist

# File 'lib/posting.rb', line 180

def images
  # Keep in mind that when users post html to craigslist, they're often not posting wonderful html...
  @images = ( 
    contents ? 
      contents.scan(IMAGE_SRC).collect{ |a| a.find{|b| !b.nil? } } :
      [] 
  ) unless @images
  
  @images
end

#img_types ⇒ `Object`

Array, which image types are listed for the post. This is always able to be pulled from the listing post-summary, and should never cause an additional page load

# File 'lib/posting.rb', line 264

def img_types
  @img_types || [ (images.length > 0) ? :img : nil, 
    (pics.length > 0) ? :pic : nil ].compact
end

#is_active_post? ⇒ `Boolean`

This is mostly used to determine if the post should be checked for parse errors. Might be useful for someone else though

Returns:

(Boolean)



322
323
324

# File 'lib/posting.rb', line 322

def is_active_post?
  [flagged_for_removal?, posting_has_expired?, deleted_by_author?].none?
end

#label ⇒ `Object`

Returns The post label. The label would appear at first glance to be indentical to the header - but its not. The label is cited on the listings pages, and generally includes everything in the header - with the exception of the location. Sometimes there’s additional information ie. ‘(map)’ on rea listings included in the header, that aren’t to be listed in the label This is also used as a bandwidth shortcut for the craigwatch program, and is a guaranteed identifier for the post, that won’t result in a full page load from the post’s url.

# File 'lib/posting.rb', line 252

def label
  unless @label or system_post?
    @label = header
    
    @label = $1 if location and /(.+?)[ ]*\(#{location}\).*?$/.match @label
  end
  
  @label
end

#location ⇒ `Object`

String, the location of the item, as best could be parsed

# File 'lib/posting.rb', line 143

def location
  if @location.nil? and html
   
    if html.at_xpath(XPATH_BLURBS)
      # This is the post-12/3/12 style:
      @location = $1 if html.xpath(XPATH_BLURBS).first.children.any?{|c| 
        LOCATION.match c.content}
    elsif craigslist_body
      # Location (when explicitly defined):
      cursor = craigslist_body.at 'ul' unless @location

      # This is the legacy style:
      # Note: Apa section includes other things in the li's (cats/dogs ok fields)
      cursor.children.each do |li|
        if LOCATION.match li.inner_html
          @location = he_decode($1) and break
          break
        end
      end if cursor

      # Real estate listings can work a little different for location:
      unless @location
        cursor = craigslist_body.at 'small'
        cursor = cursor.previous until cursor.nil? or cursor.text?
        
        @location = he_decode(cursor.to_s.strip) if cursor
      end
      
      # So, *sometimes* the location just ends up being in the header, I don't know why:
      @location = $1 if @location.nil? and HEADER_LOCATION.match header
    end
  end
  
  @location
end

#pics ⇒ `Object`

Array, urls of the post’s craigslist-hosted images

# File 'lib/posting.rb', line 192

def pics
  unless @pics
    @pics = []
    
    if html 
      if html.at_xpath(XPATH_PICS)
        @pics = html.xpath(XPATH_PICS).collect(&:value)
      elsif craigslist_body
        # This is the pre-12/3/12 style:
        # Now let's find the craigslist hosted images:
        img_table = (craigslist_body / 'table').find{|e| e.name == 'table' and e[:summary] == 'craigslist hosted images'}
      
        @pics = (img_table / 'img').collect{|i| i[:src]} if img_table
      end
    end
  end
  
  @pics
end

#post_date ⇒ `Object`

Reflects only the date portion of the posting. Does not include hours/minutes. This is useful when reflecting the listing scrapes, and can be safely used if you wish conserve bandwidth by not pulling an entire post from a listing scrape.

# File 'lib/posting.rb', line 241

def post_date
  @post_date = Time.local(*[0]*3+post_time.to_a[3...10]) unless @post_date or post_time.nil?
  
  @post_date
end

#post_time ⇒ `Object`

Time, reflects the full timestamp of the posting

# File 'lib/posting.rb', line 104

def post_time
  unless @post_time
    cursor = html_head.at 'hr' if html_head
    cursor = cursor.next until cursor.nil? or POST_DATE.match cursor.to_s
    @post_time = Time.parse $1 if $1
  end
  
  @post_time
end

#posting_has_expired? ⇒ `Boolean`

Returns true if this Post was parsed, and represents a ‘This posting has expired.’ notice

Returns:

(Boolean)

# File 'lib/posting.rb', line 231

def posting_has_expired?
  @posting_has_expired = (
    system_post? and header_as_plain == "This posting has expired."
  ) if @posting_has_expired.nil?
  
  @posting_has_expired
end

#posting_id ⇒ `Object`

Integer, Craigslist’s unique posting id

# File 'lib/posting.rb', line 115

def posting_id
  if @posting_id 

  elsif USERBODY_PARTS.match html_source
    # Old style:
    html_footer = $4
    cursor = Nokogiri::HTML html_footer, nil, HTML_ENCODING 
    cursor = cursor.next until cursor.nil? or 
    @posting_id = $1.to_i if POSTING_ID.match html_footer.to_s
  else
    # Post 12/3
    @posting_id = $1.to_i if POSTING_ID.match html.xpath("//*[@class='postingidtext']").to_s
  end

  @posting_id
end

#price ⇒ `Object`

Returns the best-guess of a price, judging by the label’s contents. Price is available when pulled from the listing summary and can be safely used if you wish conserve bandwidth by not pulling an entire post from a listing scrape.



299
300
301

# File 'lib/posting.rb', line 299

def price
  $1.tr('$','').to_f if label and PRICE.match label
end

#reply_to ⇒ `Object`

String, represents the post’s reply-to address, if listed

# File 'lib/posting.rb', line 89

def reply_to
  unless @reply_to
    if html.at_xpath(XPATH_REPLY_TO)
      @reply_to = html.at_xpath(XPATH_REPLY_TO).content
    else
      cursor = html_head.at 'hr' if html_head
      cursor = cursor.next until cursor.nil? or cursor.name == 'a'
      @reply_to = $1 if cursor and REPLY_TO.match he_decode(cursor.inner_html)
    end
  end
  
  @reply_to
end

#section ⇒ `Object`

Retrieves the most-relevant craigslist ‘section’ of the post. This is generally the same as full_section.last. However, this (sometimes/rarely) conserves bandwidth by pulling this field from the listing post-summary

# File 'lib/posting.rb', line 271

def section
  unless @section
    @section = full_section.last if full_section  
  end
  
  @section
end

#system_post? ⇒ `Boolean`

Some posts (deleted_by_author, flagged_for_removal) are common template posts that craigslist puts up in lieu of an original This returns true or false if that case applies

Returns:

(Boolean)



316
317
318

# File 'lib/posting.rb', line 316

def system_post?
  [contents,posting_id,post_time,title].all?{|f| f.nil?}
end

#title ⇒ `Object`

String, the item’s title

# File 'lib/posting.rb', line 65

def title
  unless @title
    title_tag = html_head.at 'title' if html_head
    @title = he_decode title_tag.inner_html if title_tag
    @title = nil if @title and @title.length == 0
  end

  @title
end

Class: CraigScrape::Posting

Overview

Constant Summary collapse

Constants inherited from Scraper

Instance Attribute Summary collapse

Attributes inherited from Scraper

Instance Method Summary collapse

Methods inherited from Scraper

Constructor Details

#initialize(*args) ⇒ Posting

Instance Attribute Details

#href ⇒ Object (readonly)

Instance Method Details

#contents ⇒ Object

#contents_as_plain ⇒ Object

#deleted_by_author? ⇒ Boolean

#flagged_for_removal? ⇒ Boolean

#full_section ⇒ Object

#has_img? ⇒ Boolean

#has_pic? ⇒ Boolean

#has_pic_or_img? ⇒ Boolean

#header ⇒ Object

#header_as_plain ⇒ Object

#images ⇒ Object

#img_types ⇒ Object

#is_active_post? ⇒ Boolean

#label ⇒ Object

#location ⇒ Object

#pics ⇒ Object

#post_date ⇒ Object

#post_time ⇒ Object

#posting_has_expired? ⇒ Boolean

#posting_id ⇒ Object

#price ⇒ Object

#reply_to ⇒ Object

#section ⇒ Object

#system_post? ⇒ Boolean

#title ⇒ Object

#initialize(*args) ⇒ `Posting`

#href ⇒ `Object` (readonly)

#contents ⇒ `Object`

#contents_as_plain ⇒ `Object`

#deleted_by_author? ⇒ `Boolean`

#flagged_for_removal? ⇒ `Boolean`

#full_section ⇒ `Object`

#has_img? ⇒ `Boolean`

#has_pic? ⇒ `Boolean`

#has_pic_or_img? ⇒ `Boolean`

#header ⇒ `Object`

#header_as_plain ⇒ `Object`

#images ⇒ `Object`

#img_types ⇒ `Object`

#is_active_post? ⇒ `Boolean`

#label ⇒ `Object`

#location ⇒ `Object`

#pics ⇒ `Object`

#post_date ⇒ `Object`

#post_time ⇒ `Object`

#posting_has_expired? ⇒ `Boolean`

#posting_id ⇒ `Object`

#price ⇒ `Object`

#reply_to ⇒ `Object`

#section ⇒ `Object`

#system_post? ⇒ `Boolean`

#title ⇒ `Object`