Module: HtmlHelper

Included in:: MyTest, SuperTest

Defined in:: lib/common/http/html_helper.rb

Overview

解析纯文本、HTML、JSON

打印可读日志

Instance Method Summary collapse

#find_attr(xpath, attr) ⇒ Object

功能: 使用XPath或CSS Selector来提取HTML元素的属性值.
#find_children(xpath, excludes = []) ⇒ Object

使用XPath或CSS Selector来提取HTML元素集合的个数.
#find_exist?(xpath) ⇒ Boolean

功能: 判断元素是否存在.
#find_grep(regexp, group = 1) ⇒ Object

功能: 正则表达式提取.
#find_has?(str) ⇒ Boolean

功能: 验证是否包含某字符串，不带正则表达式的功能.
#find_html(xpath) ⇒ Object

功能: 使用XPath或CSS Selector来提取html片段.
#find_raw_attr(xpath, attr) ⇒ Object

功能: 此方法不推荐使用，请使用 find_attr.
#find_scan(regexp, &blk) ⇒ Object

功能: 使用正则表达式的扫描，取出满足的集合.
#find_size(xpath) ⇒ Object

功能: 使用XPath或CSS Selector来提取HTML元素集合的个数.
#find_text(xpath) ⇒ Object

功能: 使用XPath或CSS Selector来提取text内容，不会包含HTML标签.
#hpricot_search(xpath, &blk) ⇒ Object

功能: 高级接口，将 Hpricot 暴露出来，请参看 Hpricot 的使用手册返回数组.

Instance Method Details

#find_attr(xpath, attr) ⇒ `Object`

功能:

使用XPath或CSS Selector来提取HTML元素的属性值

参数解释:

xpath 定位，可以使xpath or selector
attr 属性名
return 返回String

Example:

Example #1:

get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;<a href=http://www.hao123.com>hao123</a>&nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">>></span></a></p> 

puts find_attr "p#km a:first", "href"
# => http://hi.baidu.com

Raises:

(HtmlError)

# File 'lib/common/http/html_helper.rb', line 199

def find_attr(xpath, attr)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty?
    elem = elems.first
    raise HtmlError, "find_attr xpath[#{xpath}] ok, but elem[#{elem}] has no attr[#{attr}]!" unless elem.has_attribute? attr
    return elem.get_attribute(attr)
end

#find_children(xpath, excludes = []) ⇒ `Object`

使用XPath或CSS Selector来提取HTML元素集合的个数

参数解释:

xpath 父亲节点，可以使xpath or selector
excludes 排除的子节点名，长度可变参数
RETURN 返回非elems内的儿子节点集合

Example:

get "http://hi.baidu.com/sys/search?word=我&from=wise"
p find_children("entry > list > blog:first")
    #=> ["title", "content", "spsspaceurlorilist", "url", "time", "spsspacenamelist"]
p find_children("entry > list > blog:first",%w{title content url time spsSpaceURLOriList})
    #=> ["spsspacenamelist"]

Raises:

(HtmlError)

# File 'lib/common/http/html_helper.rb', line 280

def find_children(xpath,excludes=[])
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty?
    elem = elems.first

    rs = []
    elem.children.each do |el|
        if el.class == Hpricot::Elem and !excludes.include?(el.name)
            rs << el.name
        end
    end
    return rs
end

#find_exist?(xpath) ⇒ `Boolean`

功能:

判断元素是否存在

参数解释:

xpath 定位，可以使xpath or selector
return 返回boolean

Example:

get "http://www.baidu.com/"
p find_exist? "title"    # => true

Returns:

(Boolean)

# File 'lib/common/http/html_helper.rb', line 96

def find_exist?(xpath)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    return !elems.empty?
end

#find_grep(regexp, group = 1) ⇒ `Object`

功能:

正则表达式提取

参数解释:

regexp 提取的正则表达式
group 正则表达式匹配组号，默认为1

Example:

Example #1:

get "http://www.baidu.com/"
puts find_grep %r"<title>(.*?)</title>"
# => 百度一下，你就知道

Example #2:

get "http://www.baidu.com/"
puts find_grep /<title>(.*?)，(.*?)<\/title>/, 2    
# => 你就知道

Raises:

(HtmlError)

# File 'lib/common/http/html_helper.rb', line 35

def find_grep(regexp, group=1)
    raise "no any http request before!" if @response == nil
    m = regexp.match @response.body
    raise HtmlError,"find_grep regexp[#{regexp}] not exist!" if m.nil?
    return m[group]
end

#find_has?(str) ⇒ `Boolean`

功能:

验证是否包含某字符串，不带正则表达式的功能

参数解释:

str 给定的字符串

Example:

Example :

get "http://hi.baidu.com/"
p find_has? "百度空间"    #=> true

Returns:

(Boolean)

# File 'lib/common/http/html_helper.rb', line 52

def find_has?(str)
    raise "no any http request before!" if @response == nil
    @response.body.include? str
end

#find_html(xpath) ⇒ `Object`

功能:

使用XPath或CSS Selector来提取html片段

参数解释:

xpath 定位，可以使xpath or selector
return 返回String

Example:

Example #1:

get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;<a href=http://www.hao123.com>hao123</a>
# &nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">>></span></a></p> 

puts find_html "p#km"
# <a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;<a href=http://www.hao123.com>hao123</a>
# &nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">>></span></a>

puts find_html "//p[@id='km']"
# 与上同

Raises:

(HtmlError)

# File 'lib/common/http/html_helper.rb', line 131

def find_html(xpath)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_html xpath[#{xpath}] is not exist!" if elems.empty?
    return elems.first.html.force_encoding "gbk"
end

#find_raw_attr(xpath, attr) ⇒ `Object`

功能:

此方法不推荐使用，请使用 find_attr

Raises:

(HtmlError)

# File 'lib/common/http/html_helper.rb', line 218

def find_raw_attr(xpath, attr)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath.downcase)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty?
    elem = elems.first
    raise HtmlError, "find_attr xpath[#{xpath}] ok, but elem[#{elem}] has no attr[#{attr}]!" unless elem.has_attribute? attr
    return elem.raw_attributes[attr]
end

#find_scan(regexp, &blk) ⇒ `Object`

功能:

使用正则表达式的扫描，取出满足的集合

参数解释:

regexp 扫描的正则表达式
return 一个二维数组，第一维是匹配上的串，第二维是正则表达式里面的group情况

Example:

Example #1: 非block

get "http://hi.baidu.com/腚腚熊/album"
p find_scan /^imgarr\[len\]=\{purl:"(.*)",psrc:"(.*)",pname:"(.*)",pnum:"(.*)"/
# => [["/%EB%EB%EB%EB%D0%DC/album/%C4%AC%C8%CF%CF%E0%B2%E1", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8ddde41f55e355daa78669b4.jpg", "默认相册", "32"], ["/%EB%EB%EB%EB%D0%DC/album/%CE%D2%B5%C4%D1%D0%BE%BF%C9%FA%CA%B1%B4%FA", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/b7c672197f61435542a9ad2f.jpg", "我的研究生时代", "14"], ["/%EB%EB%EB%EB%D0%DC/album/3", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/0ded0b80781ddbcb9123d9b2.jpg", "3", "1"], ["/%EB%EB%EB%EB%D0%DC/album/4", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/e62608ea188ae1cfd539c991.jpg", "4", "1"], ["/%EB%EB%EB%EB%D0%DC/album/Kongde", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8c3e0aef76094b36acafd5f0.jpg", "Kongde", "0"]]

Example #2: block方式

get "http://hi.baidu.com/腚腚熊/album"
find_scan /^imgarr\[len\]=\{purl:"(.*)",psrc:"(.*)",pname:"(.*)",pnum:"(.*)"/ do |url,src,name,num|
    p "url:#{url} src:#{src} name:#{name} num:#{num}"
end
# => url:/%EB%EB%EB%EB%D0%DC/album/%C4%AC%C8%CF%CF%E0%B2%E1 src:http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8ddde41f55e355daa78669b4.jpg name:默认相册 num:32
# => ...

# File 'lib/common/http/html_helper.rb', line 81

def find_scan(regexp,&blk)
    raise "no any http request before!" if @response == nil
    return @response.body.scan regexp,&blk
end

#find_size(xpath) ⇒ `Object`

功能:

使用XPath或CSS Selector来提取HTML元素集合的个数

参数解释:

xpath 定位，可以使xpath or selector
return 返回Fixnum

Example:

Example #1:

get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;<a href=http://www.hao123.com>hao123</a>&nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">>></span></a></p> 

puts find_size "p#km a"
# => 3
# 上层的意义可能是“输入框下面这一行显示3个超链接

# File 'lib/common/http/html_helper.rb', line 253

def find_size(xpath)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    return elems.size
end

#find_text(xpath) ⇒ `Object`

功能:

使用XPath或CSS Selector来提取text内容，不会包含HTML标签

参数解释:

xpath 定位，可以使xpath or selector
return 返回String

Example:

Example #1:

get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>&nbsp;&nbsp;
# <a href=http://www.hao123.com>hao123</a>&nbsp;|&nbsp;<a href=/more/>更多<span style="font-family:宋体">
# >></span></a></p> 

puts find_text "p#km"
# => 空间&nbsp;&nbsp;hao123&nbsp;|&nbsp;更多>>

puts find_text "//p[@id='km']"
# 与上同

Raises:

(HtmlError)

# File 'lib/common/http/html_helper.rb', line 167

def find_text(xpath)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    elems = @hpricot.search(xpath)
    if block_given? 
        index = -1
        elems = elems.select do |elem|
            index = index + 1
            yield elem,index
        end
    end
    raise HtmlError, "find_html xpath[#{xpath}] is not exist!" if elems.empty?
    return elems.first.inner_text
end

#hpricot_search(xpath, &blk) ⇒ `Object`

功能:

高级接口，将 Hpricot 暴露出来，请参看 Hpricot 的使用手册返回数组

参数解释:

xpath 定位，可以使xpath or selector
return 返回Hpricot::Elems

Example:

get "http://www.baidu.com"
hpricot_search("div").each |elem|
    p elem
end

# File 'lib/common/http/html_helper.rb', line 317

def hpricot_search(xpath, &blk)
    raise "no any http request before!" if @response == nil
    @hpricot ||= Hpricot(@response.body)
    @hpricot.search(xpath, &blk)
end

Module: HtmlHelper

Overview

Instance Method Summary collapse

功能: 使用XPath或CSS Selector来提取HTML元素的属性值.

功能: 判断元素是否存在.

功能: 正则表达式提取.

功能: 验证是否包含某字符串，不带正则表达式的功能.

功能: 使用XPath或CSS Selector来提取html片段.

功能: 此方法不推荐使用，请使用 find_attr.

功能: 使用正则表达式的扫描，取出满足的集合.

功能: 使用XPath或CSS Selector来提取HTML元素集合的个数.

功能: 使用XPath或CSS Selector来提取text内容，不会包含HTML标签.

功能: 高级接口，将 Hpricot 暴露出来，请参看 Hpricot 的使用手册 返回数组.

Instance Method Details

#find_attr(xpath, attr) ⇒ Object

功能:

参数解释:

Example:

#find_children(xpath, excludes = []) ⇒ Object

参数解释:

Example:

#find_exist?(xpath) ⇒ Boolean

功能:

参数解释:

Example:

#find_grep(regexp, group = 1) ⇒ Object

功能:

参数解释:

Example:

#find_has?(str) ⇒ Boolean

功能:

参数解释:

Example:

#find_html(xpath) ⇒ Object

功能:

参数解释:

Example:

#find_raw_attr(xpath, attr) ⇒ Object

功能:

#find_scan(regexp, &blk) ⇒ Object

功能:

参数解释:

Example:

#find_size(xpath) ⇒ Object

功能:

参数解释:

Example:

#find_text(xpath) ⇒ Object

功能:

参数解释:

Example:

#hpricot_search(xpath, &blk) ⇒ Object

功能:

参数解释:

Example:

功能: 高级接口，将 Hpricot 暴露出来，请参看 Hpricot 的使用手册返回数组.

#find_attr(xpath, attr) ⇒ `Object`

#find_children(xpath, excludes = []) ⇒ `Object`

#find_exist?(xpath) ⇒ `Boolean`

#find_grep(regexp, group = 1) ⇒ `Object`

#find_has?(str) ⇒ `Boolean`

#find_html(xpath) ⇒ `Object`

#find_raw_attr(xpath, attr) ⇒ `Object`

#find_scan(regexp, &blk) ⇒ `Object`

#find_size(xpath) ⇒ `Object`

#find_text(xpath) ⇒ `Object`

#hpricot_search(xpath, &blk) ⇒ `Object`