Module: Scrapifier::Support

Includes:: XPath

Included in:: Methods

Defined in:: lib/scrapifier/support.rb

Overview

Support methods to get, check and organize data.

Constant Summary

Constants included from XPath

XPath::AUTHOR, XPath::DESC, XPath::ENCODE, XPath::IMG, XPath::KEYWORDS, XPath::LANG, XPath::REPLY_TO, XPath::TITLE

Class Method Summary collapse

.sf_check_img_ext(images, allowed = []) ⇒ Object

Filter images returning those with the allowed extentions.
.sf_domain(uri) ⇒ Object

Return the URI domain.
.sf_eval_uri(uri, exts = []) ⇒ Object

Evaluate the URI’s HTML document and get its metadata.
.sf_fix_imgs(imgs, uri, exts = []) ⇒ Object

Check and return only the valid image URIs.
.sf_fix_protocol(path, domain) ⇒ Object

Fix image URIs that don’t have a protocol/domain set.
.sf_img_regex(exts = []) ⇒ Object

Build image regexes according to the required extensions.
.sf_regex(type, *args) ⇒ Object

Select regexes for URIs, protocols and image extensions.
.sf_uri_regex ⇒ Object

Build a hash with the URI regexes.
.sf_xpaths ⇒ Object

Organize XPaths.

Class Method Details

.sf_check_img_ext(images, allowed = []) ⇒ `Object`

Filter images returning those with the allowed extentions.

Example:

>> sf_check_img_ext('http://source.com/image.gif', :jpg)
=> []
>> sf_check_img_ext(
     ['http://source.com/image.gif','http://source.com/image.jpg'],
     [:jpg, :png]
   )
=> ['http://source.com/image.jpg']

Arguments:

images: (String or Array)
  - Images which will be checked.
allowed: (String, Symbol or Array)
  - Allowed types of image extension.

# File 'lib/scrapifier/support.rb', line 57

def sf_check_img_ext(images, allowed = [])
  allowed ||= []
  if images.is_a?(String)
    images = images.split
  elsif !images.is_a?(Array)
    images = []
  end
  images.select { |i| i =~ sf_regex(:image, allowed) }
end

.sf_domain(uri) ⇒ `Object`

Return the URI domain.

Example:

>> sf_domain('http://adtangerine.com')
=> 'adtangerine.com'

Arguments:

uri: (String)
  - URI.

# File 'lib/scrapifier/support.rb', line 186

def sf_domain(uri)
  uri = uri.to_s.split('/')
  uri.empty? ? '' : uri[2]
end

.sf_eval_uri(uri, exts = []) ⇒ `Object`

Evaluate the URI’s HTML document and get its metadata.

Example:

>> eval_uri('http://adtangerine.com', [:png])
=> {
     :title => "AdTangerine | Advertising Platform for Social Media",
     :description => "AdTangerine is an advertising platform that...",
     :images => [
       "http://adtangerine.com/assets/logo_adt_og.png",
       "http://adtangerine.com/assets/logo_adt_og.png
     ],
     :uri => "http://adtangerine.com"
   }

Arguments:

uri: (String)
  - URI.
exts: (Array)
  - Allowed type of images.

# File 'lib/scrapifier/support.rb', line 27

def sf_eval_uri(uri, exts = [])
  doc = Nokogiri::HTML(open(uri).read)
  doc.encoding, meta = 'utf-8', { uri: uri }

  [:title, :description, :keywords, :lang, :encode, :reply_to, :author].each do |k|
    node = doc.xpath(sf_xpaths[k])[0]
    meta[k] = node.nil? ? '-' : node.text
  end
  meta[:images] = sf_fix_imgs(doc.xpath(sf_xpaths[:image]), uri, exts)

  meta
rescue SocketError
  {}
end

.sf_fix_imgs(imgs, uri, exts = []) ⇒ `Object`

Check and return only the valid image URIs.

Example:

>>  sf_fix_imgs(
      ['http://adtangerine.com/image.png', '/assets/image.jpg'],
      'http://adtangerine.com',
      :jpg
    )
=> ['http://adtangerine/assets/image.jpg']

Arguments:

imgs: (Array)
  - Image URIs got from the HTML doc.
uri: (String)
  - Used as basis to the URIs that don't have any protocol/domain set.
exts: (Symbol or Array)
  -  Allowed image extesntions.

# File 'lib/scrapifier/support.rb', line 145

def sf_fix_imgs(imgs, uri, exts = [])
  sf_check_img_ext(imgs.map do |img|
    img = img.to_s
    unless img =~ sf_regex(:protocol)
      img = sf_fix_protocol(img, sf_domain(uri))
    end
    img if img =~ sf_regex(:image)
  end.compact, exts)
end

.sf_fix_protocol(path, domain) ⇒ `Object`

Fix image URIs that don’t have a protocol/domain set.

Example:

>> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com')
=> 'http://adtangerine/assets/image.jpg'
>> sf_fix_protocol(
     '//s.ytimg.com/yts/img/youtub_img.png',
     'https://youtube.com'
   )
=> 'https://s.ytimg.com/yts/img/youtub_img.png'

Arguments:

path: (String)
  - URI path having no protocol/domain set.
domain: (String)
  - Domain that will be prepended into the path.

# File 'lib/scrapifier/support.rb', line 170

def sf_fix_protocol(path, domain)
  if path =~ %r{^//[^/]+}
    'http:' << path
  else
    "http://#{domain}#{'/' unless path =~ %r{^/[^/]+}}#{path}"
  end
end

.sf_img_regex(exts = []) ⇒ `Object`

Build image regexes according to the required extensions.

Example:

>> sf_img_regex
=> /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i
>> sf_img_regex([:jpg, :png])
=> /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i

Arguments:

exts: (Array)
  - Image extensions which will be included in the regex.

# File 'lib/scrapifier/support.rb', line 107

def sf_img_regex(exts = [])
  exts = [exts].flatten unless exts.is_a?(Array)
  if exts.nil? || exts.empty?
    exts = %w(jpg jpeg png gif)
  elsif exts.include?(:jpg) && !exts.include?(:jpeg)
    exts.push :jpeg
  end
  %r{(^http{1}[s]?://([w]{3}\.)?.+\.(#{exts.join('|')})(\?.+)?$)}i
end

.sf_regex(type, *args) ⇒ `Object`

Select regexes for URIs, protocols and image extensions.

Example:

>> sf_regex(:uri)
=> /\b((((ht|f)tp[s]?:\/\/).../i,
>> sf_regex(:image, :jpg)
=> /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i

Arguments:

type: (Symbol or String)
  - Regex type: :uri, :protocol, :image
args: (*)
  - Anything.

# File 'lib/scrapifier/support.rb', line 79

def sf_regex(type, *args)
  type = type.to_sym unless type.is_a? Symbol
  type == :image && sf_img_regex(args.flatten) || sf_uri_regex[type]
end

.sf_uri_regex ⇒ `Object`

Build a hash with the URI regexes.

# File 'lib/scrapifier/support.rb', line 85

def sf_uri_regex
  { uri: %r{\b(
           (((ht|f)tp[s]?://)|([a-z0-9]+\.))+
           (?<!@)
           ([a-z0-9\_\-]+)
           (\.[a-z]+)+
           ([\?/\:][a-z0-9_=%&@\?\./\-\:\#\(\)]+)?
           /?
         )}ix,
    protocol: /((ht|f)tp[s]?)/i }
end

.sf_xpaths ⇒ `Object`

Organize XPaths.

# File 'lib/scrapifier/support.rb', line 118

def sf_xpaths
  { title: XPath::TITLE,
    description: XPath::DESC,
    keywords: XPath::KEYWORDS,
    lang: XPath::LANG,
    encode: XPath::ENCODE,
    reply_to: XPath::REPLY_TO,
    author: XPath::AUTHOR,
    image: XPath::IMG }
end

Module: Scrapifier::Support

Overview

Constant Summary

Constants included from XPath

Class Method Summary collapse

Class Method Details

.sf_check_img_ext(images, allowed = []) ⇒ Object

.sf_domain(uri) ⇒ Object

.sf_eval_uri(uri, exts = []) ⇒ Object

.sf_fix_imgs(imgs, uri, exts = []) ⇒ Object

.sf_fix_protocol(path, domain) ⇒ Object

.sf_img_regex(exts = []) ⇒ Object

.sf_regex(type, *args) ⇒ Object

.sf_uri_regex ⇒ Object

.sf_xpaths ⇒ Object

.sf_check_img_ext(images, allowed = []) ⇒ `Object`

.sf_domain(uri) ⇒ `Object`

.sf_eval_uri(uri, exts = []) ⇒ `Object`

.sf_fix_imgs(imgs, uri, exts = []) ⇒ `Object`

.sf_fix_protocol(path, domain) ⇒ `Object`

.sf_img_regex(exts = []) ⇒ `Object`

.sf_regex(type, *args) ⇒ `Object`

.sf_uri_regex ⇒ `Object`

.sf_xpaths ⇒ `Object`