Module: Wovnrb::Helpers::NokogumboHelper

Defined in:
lib/wovnrb/helpers/nokogumbo_helper.rb

Class Method Summary collapse

Class Method Details

.parse_fragment(html_string, encoding = 'UTF-8') ⇒ Object

www.rubydoc.info/gems/nokogumbo/Nokogiri/HTML5#fragment-class_method

Nokogumbo does not properly support parsing fragment and the current implementation of Nokogiri::HTML5.fragment does not handle encoding (second line of code below).



19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/wovnrb/helpers/nokogumbo_helper.rb', line 19

def parse_fragment(html_string, encoding = 'UTF-8')
  doc = Nokogiri::HTML5.parse(html_string)
  doc.encoding = encoding
  fragment = Nokogiri::HTML::DocumentFragment.new(doc)

  if doc.children.length != 1 or doc.children.first.name != 'html'
    # no HTML?  Return document as is
    fragment = doc
  else
    # examine children of HTML element
    children = doc.children.first.children

    # head is always first.  If present, take children but otherwise
    # ignore the head element
    if children.length > 0 and doc.children.first.name = 'head'
      fragment << children.shift.children
    end

    # body may be next, or last.  If found, take children but otherwise
    # ignore the body element.  Also take any remaining elements, taking
    # care to preserve order.
    if children.length > 0 and doc.children.first.name = 'body'
      fragment << children.shift.children
      fragment << children
    elsif children.length > 0 and doc.children.last.name = 'body'
      body = children.pop
      fragment << children
      fragment << body.children
    else
      fragment << children
    end
  end

  # return result
  fragment
end

.parse_html(html_string, encoding = 'UTF-8') ⇒ Object



4
5
6
7
8
9
10
11
12
# File 'lib/wovnrb/helpers/nokogumbo_helper.rb', line 4

def parse_html(html_string, encoding = 'UTF-8')
  if /<html/i.match?(html_string.strip[0..999])
    d = Nokogiri::HTML5(html_string)
    d.encoding = encoding
    d
  else
    parse_fragment(html_string, encoding)
  end
end