Module: Wovnrb::Helpers::NokogumboHelper
- Defined in:
- lib/wovnrb/helpers/nokogumbo_helper.rb
Class Method Summary collapse
- .parse_fragment(html_string, encoding = 'UTF-8') ⇒ Object
- .parse_html(html_string, encoding = 'UTF-8') ⇒ Object
Class Method Details
.parse_fragment(html_string, encoding = 'UTF-8') ⇒ Object
www.rubydoc.info/gems/nokogumbo/Nokogiri/HTML5#fragment-class_method
Nokogumbo does not properly support parsing fragment and the current implementation of Nokogiri::HTML5.fragment does not handle encoding (second line of code below).
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# File 'lib/wovnrb/helpers/nokogumbo_helper.rb', line 19 def parse_fragment(html_string, encoding = 'UTF-8') doc = Nokogiri::HTML5.parse(html_string) doc.encoding = encoding fragment = Nokogiri::HTML::DocumentFragment.new(doc) if doc.children.length != 1 or doc.children.first.name != 'html' # no HTML? Return document as is fragment = doc else # examine children of HTML element children = doc.children.first.children # head is always first. If present, take children but otherwise # ignore the head element if children.length > 0 and doc.children.first.name = 'head' fragment << children.shift.children end # body may be next, or last. If found, take children but otherwise # ignore the body element. Also take any remaining elements, taking # care to preserve order. if children.length > 0 and doc.children.first.name = 'body' fragment << children.shift.children fragment << children elsif children.length > 0 and doc.children.last.name = 'body' body = children.pop fragment << children fragment << body.children else fragment << children end end # return result fragment end |
.parse_html(html_string, encoding = 'UTF-8') ⇒ Object
4 5 6 7 8 9 10 11 12 |
# File 'lib/wovnrb/helpers/nokogumbo_helper.rb', line 4 def parse_html(html_string, encoding = 'UTF-8') if /<html/i.match?(html_string.strip[0..999]) d = Nokogiri::HTML5(html_string) d.encoding = encoding d else parse_fragment(html_string, encoding) end end |