Class: Bcat::HeadParser

Inherits:
Object
  • Object
show all
Defined in:
lib/bcat/html.rb

Overview

Parses HTML until the first displayable body character and provides methods for accessing head and body contents.

Constant Summary collapse

HEAD_TOKS =
[
  /\A(<!DOCTYPE.*?>)/m,
  /\A(<title.*?>.*?<\/title>)/mi,
  /\A(<script.*?>.*?<\/script>)/mi,
  /\A(<style.*?>.*?<\/style>)/mi,
  /\A(<(?:html|head|meta|link|base).*?>)/mi,
  /\A(<\/(?:html|head|meta|link|base|script|style|title)>)/mi,
  /\A(<!--(.*?)-->)/m
]
BODY_TOKS =
[
  /\A[^<]/,
  /\A<(?!html|head|meta|link|base|script|style|title).*?>/
]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeHeadParser

Returns a new instance of HeadParser.

[View source]

8
9
10
11
12
13
# File 'lib/bcat/html.rb', line 8

def initialize
  @buf = ''
  @head = []
  @body = nil
  @html = nil
end

Instance Attribute Details

#bufObject

Returns the value of attribute buf.


6
7
8
# File 'lib/bcat/html.rb', line 6

def buf
  @buf
end

Instance Method Details

#body(inject = nil) ⇒ Object

The current body contents. The <body> tag is guaranteed to be present. If a <body> was included in the input, it’s preserved with original attributes; otherwise, a <body> tag is inserted. The inject argument can be used to insert a string as the immediate descendant of the <body> tag.

[View source]

49
50
51
52
53
54
55
# File 'lib/bcat/html.rb', line 49

def body(inject=nil)
  if @body =~ /\A\s*(<body.*?>)(.*)/mi
    [$1, inject, $2].compact.join("\n")
  else
    ["<body>", inject, @body].compact.join("\n")
  end
end

#complete?Boolean

Truthy once the first displayed character of the body has arrived.

Returns:

  • (Boolean)
[View source]

27
28
29
# File 'lib/bcat/html.rb', line 27

def complete?
  !@body.nil?
end

#feed(data) ⇒ Object

Called to parse new data as it arrives.

[View source]

16
17
18
19
20
21
22
23
24
# File 'lib/bcat/html.rb', line 16

def feed(data)
  if complete?
    @body << data
  else
    @buf << data
    parse(@buf)
  end
  complete?
end

#headObject

The head contents without any DOCTYPE, <html>, or <head> tags. This should consist of only <style>, <script>, <link>, <meta>, <base>, and <title> tags.

[View source]

41
42
43
# File 'lib/bcat/html.rb', line 41

def head
  @head.join.gsub(/<\/?(?:html|head|!DOCTYPE).*?>/mi, '')
end

#html?Boolean

Determine if the input is HTML. This is nil before the first non-whitespace character is received, true if the first non-whitespace character is a ‘<’, and false if the first non-whitespace character is something other than ‘<’.

Returns:

  • (Boolean)
[View source]

35
36
37
# File 'lib/bcat/html.rb', line 35

def html?
  @html
end

#parse(buf = @buf) ⇒ Object

Parses buf into head and body parts. Basic approach is to eat anything possibly body related until we hit text or a body element.

[View source]

74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/bcat/html.rb', line 74

def parse(buf=@buf)
  if @html.nil?
    if buf =~ /\A\s*[<]/m
      @html = true
    elsif buf =~ /\A\s*[^<]/m
      @html = false
    end
  end

  while !buf.empty?
    buf.sub!(/\A(\s+)/m) { @head << $1 ; '' }
    matched =
      HEAD_TOKS.any? do |tok|
        buf.sub!(tok) do
          @head << $1
          ''
        end
      end
    break unless matched
  end


  if buf.empty?
    buf
  elsif BODY_TOKS.any? { |tok| buf =~ tok }
    @body = buf
    nil
  else
    buf
  end
end