Module: Scrapifier::Support
Overview
Support methods to get, check and organize data.
Constant Summary
Constants included from XPath
XPath::AUTHOR, XPath::DESC, XPath::ENCODE, XPath::IMG, XPath::KEYWORDS, XPath::LANG, XPath::REPLY_TO, XPath::TITLE
Class Method Summary collapse
-
.sf_check_img_ext(images, allowed = []) ⇒ Object
Filter images returning those with the allowed extentions.
-
.sf_domain(uri) ⇒ Object
Return the URI domain.
-
.sf_eval_uri(uri, exts = []) ⇒ Object
Evaluate the URI’s HTML document and get its metadata.
-
.sf_fix_imgs(imgs, uri, exts = []) ⇒ Object
Check and return only the valid image URIs.
-
.sf_fix_protocol(path, domain) ⇒ Object
Fix image URIs that don’t have a protocol/domain set.
-
.sf_img_regex(exts = []) ⇒ Object
Build image regexes according to the required extensions.
-
.sf_regex(type, *args) ⇒ Object
Select regexes for URIs, protocols and image extensions.
-
.sf_uri_regex ⇒ Object
Build a hash with the URI regexes.
-
.sf_xpaths ⇒ Object
Organize XPaths.
Class Method Details
.sf_check_img_ext(images, allowed = []) ⇒ Object
Filter images returning those with the allowed extentions.
Example:
>> sf_check_img_ext('http://source.com/image.gif', :jpg)
=> []
>> sf_check_img_ext(
['http://source.com/image.gif','http://source.com/image.jpg'],
[:jpg, :png]
)
=> ['http://source.com/image.jpg']
Arguments:
images: (String or Array)
- Images which will be checked.
allowed: (String, Symbol or Array)
- Allowed types of image extension.
57 58 59 60 61 62 63 64 65 |
# File 'lib/scrapifier/support.rb', line 57 def sf_check_img_ext(images, allowed = []) allowed ||= [] if images.is_a?(String) images = images.split elsif !images.is_a?(Array) images = [] end images.select { |i| i =~ sf_regex(:image, allowed) } end |
.sf_domain(uri) ⇒ Object
Return the URI domain.
Example:
>> sf_domain('http://adtangerine.com')
=> 'adtangerine.com'
Arguments:
uri: (String)
- URI.
186 187 188 189 |
# File 'lib/scrapifier/support.rb', line 186 def sf_domain(uri) uri = uri.to_s.split('/') uri.empty? ? '' : uri[2] end |
.sf_eval_uri(uri, exts = []) ⇒ Object
Evaluate the URI’s HTML document and get its metadata.
Example:
>> eval_uri('http://adtangerine.com', [:png])
=> {
:title => "AdTangerine | Advertising Platform for Social Media",
:description => "AdTangerine is an advertising platform that...",
:images => [
"http://adtangerine.com/assets/logo_adt_og.png",
"http://adtangerine.com/assets/logo_adt_og.png
],
:uri => "http://adtangerine.com"
}
Arguments:
uri: (String)
- URI.
exts: (Array)
- Allowed type of images.
27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# File 'lib/scrapifier/support.rb', line 27 def sf_eval_uri(uri, exts = []) doc = Nokogiri::HTML(open(uri).read) doc.encoding, = 'utf-8', { uri: uri } [:title, :description, :keywords, :lang, :encode, :reply_to, :author].each do |k| node = doc.xpath(sf_xpaths[k])[0] [k] = node.nil? ? '-' : node.text end [:images] = sf_fix_imgs(doc.xpath(sf_xpaths[:image]), uri, exts) rescue SocketError {} end |
.sf_fix_imgs(imgs, uri, exts = []) ⇒ Object
Check and return only the valid image URIs.
Example:
>> sf_fix_imgs(
['http://adtangerine.com/image.png', '/assets/image.jpg'],
'http://adtangerine.com',
:jpg
)
=> ['http://adtangerine/assets/image.jpg']
Arguments:
imgs: (Array)
- Image URIs got from the HTML doc.
uri: (String)
- Used as basis to the URIs that don't have any protocol/domain set.
exts: (Symbol or Array)
- Allowed image extesntions.
145 146 147 148 149 150 151 152 153 |
# File 'lib/scrapifier/support.rb', line 145 def sf_fix_imgs(imgs, uri, exts = []) sf_check_img_ext(imgs.map do |img| img = img.to_s unless img =~ sf_regex(:protocol) img = sf_fix_protocol(img, sf_domain(uri)) end img if img =~ sf_regex(:image) end.compact, exts) end |
.sf_fix_protocol(path, domain) ⇒ Object
Fix image URIs that don’t have a protocol/domain set.
Example:
>> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com')
=> 'http://adtangerine/assets/image.jpg'
>> sf_fix_protocol(
'//s.ytimg.com/yts/img/youtub_img.png',
'https://youtube.com'
)
=> 'https://s.ytimg.com/yts/img/youtub_img.png'
Arguments:
path: (String)
- URI path having no protocol/domain set.
domain: (String)
- Domain that will be prepended into the path.
170 171 172 173 174 175 176 |
# File 'lib/scrapifier/support.rb', line 170 def sf_fix_protocol(path, domain) if path =~ %r{^//[^/]+} 'http:' << path else "http://#{domain}#{'/' unless path =~ %r{^/[^/]+}}#{path}" end end |
.sf_img_regex(exts = []) ⇒ Object
Build image regexes according to the required extensions.
Example:
>> sf_img_regex
=> /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i
>> sf_img_regex([:jpg, :png])
=> /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i
Arguments:
exts: (Array)
- Image extensions which will be included in the regex.
107 108 109 110 111 112 113 114 115 |
# File 'lib/scrapifier/support.rb', line 107 def sf_img_regex(exts = []) exts = [exts].flatten unless exts.is_a?(Array) if exts.nil? || exts.empty? exts = %w(jpg jpeg png gif) elsif exts.include?(:jpg) && !exts.include?(:jpeg) exts.push :jpeg end %r{(^http{1}[s]?://([w]{3}\.)?.+\.(#{exts.join('|')})(\?.+)?$)}i end |
.sf_regex(type, *args) ⇒ Object
Select regexes for URIs, protocols and image extensions.
Example:
>> sf_regex(:uri)
=> /\b((((ht|f)tp[s]?:\/\/).../i,
>> sf_regex(:image, :jpg)
=> /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i
Arguments:
type: (Symbol or String)
- Regex type: :uri, :protocol, :image
args: (*)
- Anything.
79 80 81 82 |
# File 'lib/scrapifier/support.rb', line 79 def sf_regex(type, *args) type = type.to_sym unless type.is_a? Symbol type == :image && sf_img_regex(args.flatten) || sf_uri_regex[type] end |
.sf_uri_regex ⇒ Object
Build a hash with the URI regexes.
85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/scrapifier/support.rb', line 85 def sf_uri_regex { uri: %r{\b( (((ht|f)tp[s]?://)|([a-z0-9]+\.))+ (?<!@) ([a-z0-9\_\-]+) (\.[a-z]+)+ ([\?/\:][a-z0-9_=%&@\?\./\-\:\#\(\)]+)? /? )}ix, protocol: /((ht|f)tp[s]?)/i } end |
.sf_xpaths ⇒ Object
Organize XPaths.
118 119 120 121 122 123 124 125 126 127 |
# File 'lib/scrapifier/support.rb', line 118 def sf_xpaths { title: XPath::TITLE, description: XPath::DESC, keywords: XPath::KEYWORDS, lang: XPath::LANG, encode: XPath::ENCODE, reply_to: XPath::REPLY_TO, author: XPath::AUTHOR, image: XPath::IMG } end |