Module: HtmlHelper

Included in:
MyTest, SuperTest
Defined in:
lib/common/http/html_helper.rb

Overview

解析纯文本、HTML、JSON

  • 打印可读日志

Instance Method Summary collapse

Instance Method Details

#find_attr(xpath, attr) ⇒ Object

功能:

使用XPath或CSS Selector来提取HTML元素的属性值

参数解释:

  • xpath 定位,可以使xpath or selector

  • attr 属性名

  • return 返回String

Example:

Example #1:

get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;<a href=http://www.hao123.com>hao123</a>&nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">>></span></a></p> 

puts find_attr "p#km a:first", "href"
# => http://hi.baidu.com

Raises:



199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# File 'lib/common/http/html_helper.rb', line 199

def find_attr(xpath, attr)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty?
    elem = elems.first
    raise HtmlError, "find_attr xpath[#{xpath}] ok, but elem[#{elem}] has no attr[#{attr}]!" unless elem.has_attribute? attr
    return elem.get_attribute(attr)
end

#find_children(xpath, excludes = []) ⇒ Object

使用XPath或CSS Selector来提取HTML元素集合的个数

参数解释:

  • xpath 父亲节点,可以使xpath or selector

  • excludes 排除的子节点名,长度可变参数

  • RETURN 返回非elems内的儿子节点集合

Example:

get "http://hi.baidu.com/sys/search?word=我&from=wise"
p find_children("entry > list > blog:first")
    #=> ["title", "content", "spsspaceurlorilist", "url", "time", "spsspacenamelist"]
p find_children("entry > list > blog:first",%w{title content url time spsSpaceURLOriList})
    #=> ["spsspacenamelist"]

Raises:



280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
# File 'lib/common/http/html_helper.rb', line 280

def find_children(xpath,excludes=[])
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty?
    elem = elems.first

    rs = []
    elem.children.each do |el|
        if el.class == Hpricot::Elem and !excludes.include?(el.name)
            rs << el.name
        end
    end
    return rs
end

#find_exist?(xpath) ⇒ Boolean

功能:

判断元素是否存在

参数解释:

  • xpath 定位,可以使xpath or selector

  • return 返回boolean

Example:

get "http://www.baidu.com/"
p find_exist? "title"    # => true

Returns:

  • (Boolean)


96
97
98
99
100
101
102
103
104
105
106
107
108
# File 'lib/common/http/html_helper.rb', line 96

def find_exist?(xpath)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    return !elems.empty?
end

#find_grep(regexp, group = 1) ⇒ Object

功能:

正则表达式提取

参数解释:

  • regexp 提取的正则表达式

  • group 正则表达式匹配组号,默认为1

Example:

Example #1:

get "http://www.baidu.com/"
puts find_grep %r"<title>(.*?)</title>"
# => 百度一下,你就知道

Example #2:

get "http://www.baidu.com/"
puts find_grep /<title>(.*?),(.*?)<\/title>/, 2    
# => 你就知道

Raises:



35
36
37
38
39
40
# File 'lib/common/http/html_helper.rb', line 35

def find_grep(regexp, group=1)
    raise "no any http request before!" if @response == nil
    m = regexp.match @response.body
    raise HtmlError,"find_grep regexp[#{regexp}] not exist!" if m.nil?
    return m[group]
end

#find_has?(str) ⇒ Boolean

功能:

验证是否包含某字符串,不带正则表达式的功能

参数解释:

  • str 给定的字符串

Example:

Example :

get "http://hi.baidu.com/"
p find_has? "百度空间"    #=> true

Returns:

  • (Boolean)


52
53
54
55
# File 'lib/common/http/html_helper.rb', line 52

def find_has?(str)
    raise "no any http request before!" if @response == nil
    @response.body.include? str
end

#find_html(xpath) ⇒ Object

功能:

使用XPath或CSS Selector来提取html片段

参数解释:

  • xpath 定位,可以使xpath or selector

  • return 返回String

Example:

Example #1:

get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;<a href=http://www.hao123.com>hao123</a>
# &nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">>></span></a></p> 

puts find_html "p#km"
# <a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;<a href=http://www.hao123.com>hao123</a>
# &nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">>></span></a>

puts find_html "//p[@id='km']"
# 与上同

Raises:



131
132
133
134
135
136
137
138
139
140
141
142
143
144
# File 'lib/common/http/html_helper.rb', line 131

def find_html(xpath)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_html xpath[#{xpath}] is not exist!" if elems.empty?
    return elems.first.html.force_encoding "gbk"
end

#find_raw_attr(xpath, attr) ⇒ Object

功能:

此方法不推荐使用,请使用 find_attr

Raises:



218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
# File 'lib/common/http/html_helper.rb', line 218

def find_raw_attr(xpath, attr)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath.downcase)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty?
    elem = elems.first
    raise HtmlError, "find_attr xpath[#{xpath}] ok, but elem[#{elem}] has no attr[#{attr}]!" unless elem.has_attribute? attr
    return elem.raw_attributes[attr]
end

#find_scan(regexp, &blk) ⇒ Object

功能:

使用正则表达式的扫描,取出满足的集合

参数解释:

  • regexp 扫描的正则表达式

  • return 一个二维数组,第一维是匹配上的串,第二维是正则表达式里面的group情况

Example:

Example #1: 非block

get "http://hi.baidu.com/腚腚熊/album"
p find_scan /^imgarr\[len\]=\{purl:"(.*)",psrc:"(.*)",pname:"(.*)",pnum:"(.*)"/
# => [["/%EB%EB%EB%EB%D0%DC/album/%C4%AC%C8%CF%CF%E0%B2%E1", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8ddde41f55e355daa78669b4.jpg", "默认相册", "32"], ["/%EB%EB%EB%EB%D0%DC/album/%CE%D2%B5%C4%D1%D0%BE%BF%C9%FA%CA%B1%B4%FA", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/b7c672197f61435542a9ad2f.jpg", "我的研究生时代", "14"], ["/%EB%EB%EB%EB%D0%DC/album/3", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/0ded0b80781ddbcb9123d9b2.jpg", "3", "1"], ["/%EB%EB%EB%EB%D0%DC/album/4", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/e62608ea188ae1cfd539c991.jpg", "4", "1"], ["/%EB%EB%EB%EB%D0%DC/album/Kongde", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8c3e0aef76094b36acafd5f0.jpg", "Kongde", "0"]]

Example #2: block方式

get "http://hi.baidu.com/腚腚熊/album"
find_scan /^imgarr\[len\]=\{purl:"(.*)",psrc:"(.*)",pname:"(.*)",pnum:"(.*)"/ do |url,src,name,num|
    p "url:#{url} src:#{src} name:#{name} num:#{num}"
end
# => url:/%EB%EB%EB%EB%D0%DC/album/%C4%AC%C8%CF%CF%E0%B2%E1 src:http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8ddde41f55e355daa78669b4.jpg name:默认相册 num:32
# => ...


81
82
83
84
# File 'lib/common/http/html_helper.rb', line 81

def find_scan(regexp,&blk)
    raise "no any http request before!" if @response == nil
    return @response.body.scan regexp,&blk
end

#find_size(xpath) ⇒ Object

功能:

使用XPath或CSS Selector来提取HTML元素集合的个数

参数解释:

  • xpath 定位,可以使xpath or selector

  • return 返回Fixnum

Example:

Example #1:

get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;<a href=http://www.hao123.com>hao123</a>&nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">>></span></a></p> 

puts find_size "p#km a"
# => 3
# 上层的意义可能是“输入框下面这一行显示3个超链接


253
254
255
256
257
258
259
260
261
262
263
264
265
# File 'lib/common/http/html_helper.rb', line 253

def find_size(xpath)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    return elems.size
end

#find_text(xpath) ⇒ Object

功能:

使用XPath或CSS Selector来提取text内容,不会包含HTML标签

参数解释:

  • xpath 定位,可以使xpath or selector

  • return 返回String

Example:

Example #1:

get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;
# <a href=http://www.hao123.com>hao123</a>&nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">
# >></span></a></p> 

puts find_text "p#km"
# => 空间&nbsp;&nbsp;hao123&nbsp;|&nbsp;更多>>

puts find_text "//p[@id='km']"
# 与上同

Raises:



167
168
169
170
171
172
173
174
175
176
177
178
179
180
# File 'lib/common/http/html_helper.rb', line 167

def find_text(xpath)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_html xpath[#{xpath}] is not exist!" if elems.empty?
    return elems.first.inner_text
end

#hpricot_search(xpath, &blk) ⇒ Object

功能:

高级接口,将 Hpricot 暴露出来,请参看 Hpricot 的使用手册返回数组

参数解释:

  • xpath 定位,可以使xpath or selector

  • return 返回Hpricot::Elems

Example:

get "http://www.baidu.com"
hpricot_search("div").each |elem|
    p elem
end


317
318
319
320
321
# File 'lib/common/http/html_helper.rb', line 317

def hpricot_search(xpath, &blk)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    @hpricot.search(xpath, &blk)
end