Class: Bcat::HeadParser

Inherits:
Object
  • Object
show all
Defined in:
lib/bcat/html.rb

Overview

Parses HTML until the first displayable body character and provides methods for accessing head and body contents.

Constant Summary collapse

HEAD_TOKS =
[
  /\A(<!DOCTYPE.*?>)/m,
  /\A(<title.*?>.*?<\/title>)/mi,
  /\A(<script.*?>.*?<\/script>)/mi,
  /\A(<style.*?>.*?<\/style>)/mi,
  /\A(<(?:html|head|meta|link|base).*?>)/mi,
  /\A(<\/(?:html|head|meta|link|base|script|style|title)>)/mi,
  /\A(<!--(.*?)-->)/m
]
BODY_TOKS =
[
  /\A[^<]/,
  /\A<(?!html|head|meta|link|base|script|style|title).*?>/
]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeHeadParser

Returns a new instance of HeadParser.



8
9
10
11
12
13
# File 'lib/bcat/html.rb', line 8

def initialize
  @buf = ''
  @head = []
  @body = nil
  @html = nil
end

Instance Attribute Details

#bufObject

Returns the value of attribute buf.



6
7
8
# File 'lib/bcat/html.rb', line 6

def buf
  @buf
end

Instance Method Details

#body(inject = nil) ⇒ Object

The current body contents. The <body> tag is guaranteed to be present. If a <body> was included in the input, it’s preserved with original attributes; otherwise, a <body> tag is inserted. The inject argument can be used to insert a string as the immediate descendant of the <body> tag.



49
50
51
52
53
54
55
# File 'lib/bcat/html.rb', line 49

def body(inject=nil)
  if @body =~ /\A\s*(<body.*?>)(.*)/mi
    [$1, inject, $2].compact.join("\n")
  else
    ["<body>", inject, @body].compact.join("\n")
  end
end

#complete?Boolean

Truthy once the first displayed character of the body has arrived.

Returns:

  • (Boolean)


27
28
29
# File 'lib/bcat/html.rb', line 27

def complete?
  !@body.nil?
end

#feed(data) ⇒ Object

Called to parse new data as it arrives.



16
17
18
19
20
21
22
23
24
# File 'lib/bcat/html.rb', line 16

def feed(data)
  if complete?
    @body << data
  else
    @buf << data
    parse(@buf)
  end
  complete?
end

#headObject

The head contents without any DOCTYPE, <html>, or <head> tags. This should consist of only <style>, <script>, <link>, <meta>, <base>, and <title> tags.



41
42
43
# File 'lib/bcat/html.rb', line 41

def head
  @head.join.gsub(/<\/?(?:html|head|!DOCTYPE).*?>/mi, '')
end

#html?Boolean

Determine if the input is HTML. This is nil before the first non-whitespace character is received, true if the first non-whitespace character is a ‘<’, and false if the first non-whitespace character is something other than ‘<’.

Returns:

  • (Boolean)


35
36
37
# File 'lib/bcat/html.rb', line 35

def html?
  @html
end

#parse(buf = @buf) ⇒ Object

Parses buf into head and body parts. Basic approach is to eat anything possibly body related until we hit text or a body element.



74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/bcat/html.rb', line 74

def parse(buf=@buf)
  if @html.nil?
    if buf =~ /\A\s*[<]/m
      @html = true
    elsif buf =~ /\A\s*[^<]/m
      @html = false
    end
  end

  while !buf.empty?
    buf.sub!(/\A(\s+)/m) { @head << $1 ; '' }
    matched =
      HEAD_TOKS.any? do |tok|
        buf.sub!(tok) do
          @head << $1
          ''
        end
      end
    break unless matched
  end


  if buf.empty?
    buf
  elsif BODY_TOKS.any? { |tok| buf =~ tok }
    @body = buf
    nil
  else
    buf
  end
end