Module: HtmlHelper
Overview
解析纯文本、HTML、JSON
-
打印可读日志
Instance Method Summary collapse
-
#find_attr(xpath, attr) ⇒ Object
功能: 使用XPath或CSS Selector来提取HTML元素的属性值.
-
#find_children(xpath, excludes = []) ⇒ Object
使用XPath或CSS Selector来提取HTML元素集合的个数.
-
#find_exist?(xpath) ⇒ Boolean
功能: 判断元素是否存在.
-
#find_grep(regexp, group = 1) ⇒ Object
功能: 正则表达式提取.
-
#find_has?(str) ⇒ Boolean
功能: 验证是否包含某字符串,不带正则表达式的功能.
-
#find_html(xpath) ⇒ Object
功能: 使用XPath或CSS Selector来提取html片段.
-
#find_raw_attr(xpath, attr) ⇒ Object
功能: 此方法不推荐使用,请使用 find_attr.
-
#find_scan(regexp, &blk) ⇒ Object
功能: 使用正则表达式的扫描,取出满足的集合.
-
#find_size(xpath) ⇒ Object
功能: 使用XPath或CSS Selector来提取HTML元素集合的个数.
-
#find_text(xpath) ⇒ Object
功能: 使用XPath或CSS Selector来提取text内容,不会包含HTML标签.
-
#hpricot_search(xpath, &blk) ⇒ Object
功能: 高级接口,将 Hpricot 暴露出来,请参看 Hpricot 的使用手册 返回数组.
Instance Method Details
#find_attr(xpath, attr) ⇒ Object
功能:
使用XPath或CSS Selector来提取HTML元素的属性值
参数解释:
-
xpath 定位,可以使xpath or selector
-
attr 属性名
-
return 返回String
Example:
Example #1:
get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a> <a href=http://www.hao123.com>hao123</a> | <a href=/more/>更多<span style="font-family:宋体">>></span></a></p>
puts find_attr "p#km a:first", "href"
# => http://hi.baidu.com
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
# File 'lib/common/http/html_helper.rb', line 199 def find_attr(xpath, attr) raise "no any http request before!" if @response == nil @hpricot ||= Hpricot(@response.body) elems = @hpricot.search(xpath) if block_given? index = -1 elems = elems.select do |elem| index = index + 1 yield elem,index end end raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty? elem = elems.first raise HtmlError, "find_attr xpath[#{xpath}] ok, but elem[#{elem}] has no attr[#{attr}]!" unless elem.has_attribute? attr return elem.get_attribute(attr) end |
#find_children(xpath, excludes = []) ⇒ Object
使用XPath或CSS Selector来提取HTML元素集合的个数
参数解释:
-
xpath 父亲节点,可以使xpath or selector
-
excludes 排除的子节点名,长度可变参数
-
RETURN 返回非elems内的儿子节点集合
Example:
get "http://hi.baidu.com/sys/search?word=我&from=wise"
p find_children("entry > list > blog:first")
#=> ["title", "content", "spsspaceurlorilist", "url", "time", "spsspacenamelist"]
p find_children("entry > list > blog:first",%w{title content url time spsSpaceURLOriList})
#=> ["spsspacenamelist"]
280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
# File 'lib/common/http/html_helper.rb', line 280 def find_children(xpath,excludes=[]) raise "no any http request before!" if @response == nil @hpricot ||= Hpricot(@response.body) elems = @hpricot.search(xpath) if block_given? index = -1 elems = elems.select do |elem| index = index + 1 yield elem,index end end raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty? elem = elems.first rs = [] elem.children.each do |el| if el.class == Hpricot::Elem and !excludes.include?(el.name) rs << el.name end end return rs end |
#find_exist?(xpath) ⇒ Boolean
功能:
判断元素是否存在
参数解释:
-
xpath 定位,可以使xpath or selector
-
return 返回boolean
Example:
get "http://www.baidu.com/"
p find_exist? "title" # => true
96 97 98 99 100 101 102 103 104 105 106 107 108 |
# File 'lib/common/http/html_helper.rb', line 96 def find_exist?(xpath) raise "no any http request before!" if @response == nil @hpricot ||= Hpricot(@response.body) elems = @hpricot.search(xpath) if block_given? index = -1 elems = elems.select do |elem| index = index + 1 yield elem,index end end return !elems.empty? end |
#find_grep(regexp, group = 1) ⇒ Object
功能:
正则表达式提取
参数解释:
-
regexp 提取的正则表达式
-
group 正则表达式匹配组号,默认为1
Example:
Example #1:
get "http://www.baidu.com/"
puts find_grep %r"<title>(.*?)</title>"
# => 百度一下,你就知道
Example #2:
get "http://www.baidu.com/"
puts find_grep /<title>(.*?),(.*?)<\/title>/, 2
# => 你就知道
35 36 37 38 39 40 |
# File 'lib/common/http/html_helper.rb', line 35 def find_grep(regexp, group=1) raise "no any http request before!" if @response == nil m = regexp.match @response.body raise HtmlError,"find_grep regexp[#{regexp}] not exist!" if m.nil? return m[group] end |
#find_has?(str) ⇒ Boolean
功能:
验证是否包含某字符串,不带正则表达式的功能
参数解释:
-
str 给定的字符串
Example:
Example :
get "http://hi.baidu.com/"
p find_has? "百度空间" #=> true
52 53 54 55 |
# File 'lib/common/http/html_helper.rb', line 52 def find_has?(str) raise "no any http request before!" if @response == nil @response.body.include? str end |
#find_html(xpath) ⇒ Object
功能:
使用XPath或CSS Selector来提取html片段
参数解释:
-
xpath 定位,可以使xpath or selector
-
return 返回String
Example:
Example #1:
get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a> <a href=http://www.hao123.com>hao123</a>
# | <a href=/more/>更多<span style="font-family:宋体">>></span></a></p>
puts find_html "p#km"
# <a href=http://hi.baidu.com>空间</a> <a href=http://www.hao123.com>hao123</a>
# | <a href=/more/>更多<span style="font-family:宋体">>></span></a>
puts find_html "//p[@id='km']"
# 与上同
131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
# File 'lib/common/http/html_helper.rb', line 131 def find_html(xpath) raise "no any http request before!" if @response == nil @hpricot ||= Hpricot(@response.body) elems = @hpricot.search(xpath) if block_given? index = -1 elems = elems.select do |elem| index = index + 1 yield elem,index end end raise HtmlError, "find_html xpath[#{xpath}] is not exist!" if elems.empty? return elems.first.html.force_encoding "gbk" end |
#find_raw_attr(xpath, attr) ⇒ Object
功能:
此方法不推荐使用,请使用 find_attr
218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
# File 'lib/common/http/html_helper.rb', line 218 def find_raw_attr(xpath, attr) raise "no any http request before!" if @response == nil @hpricot ||= Hpricot(@response.body) elems = @hpricot.search(xpath.downcase) if block_given? index = -1 elems = elems.select do |elem| index = index + 1 yield elem,index end end raise HtmlError, "find_attr xpath[#{xpath}] is not exist!" if elems.empty? elem = elems.first raise HtmlError, "find_attr xpath[#{xpath}] ok, but elem[#{elem}] has no attr[#{attr}]!" unless elem.has_attribute? attr return elem.raw_attributes[attr] end |
#find_scan(regexp, &blk) ⇒ Object
功能:
使用正则表达式的扫描,取出满足的集合
参数解释:
-
regexp 扫描的正则表达式
-
return 一个二维数组,第一维是匹配上的串,第二维是正则表达式里面的group情况
Example:
Example #1: 非block
get "http://hi.baidu.com/腚腚熊/album"
p find_scan /^imgarr\[len\]=\{purl:"(.*)",psrc:"(.*)",pname:"(.*)",pnum:"(.*)"/
# => [["/%EB%EB%EB%EB%D0%DC/album/%C4%AC%C8%CF%CF%E0%B2%E1", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8ddde41f55e355daa78669b4.jpg", "默认相册", "32"], ["/%EB%EB%EB%EB%D0%DC/album/%CE%D2%B5%C4%D1%D0%BE%BF%C9%FA%CA%B1%B4%FA", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/b7c672197f61435542a9ad2f.jpg", "我的研究生时代", "14"], ["/%EB%EB%EB%EB%D0%DC/album/3", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/0ded0b80781ddbcb9123d9b2.jpg", "3", "1"], ["/%EB%EB%EB%EB%D0%DC/album/4", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/e62608ea188ae1cfd539c991.jpg", "4", "1"], ["/%EB%EB%EB%EB%D0%DC/album/Kongde", "http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8c3e0aef76094b36acafd5f0.jpg", "Kongde", "0"]]
Example #2: block方式
get "http://hi.baidu.com/腚腚熊/album"
find_scan /^imgarr\[len\]=\{purl:"(.*)",psrc:"(.*)",pname:"(.*)",pnum:"(.*)"/ do |url,src,name,num|
p "url:#{url} src:#{src} name:#{name} num:#{num}"
end
# => url:/%EB%EB%EB%EB%D0%DC/album/%C4%AC%C8%CF%CF%E0%B2%E1 src:http://hiphotos.baidu.com/%EB%EB%EB%EB%D0%DC/abpic/item/8ddde41f55e355daa78669b4.jpg name:默认相册 num:32
# => ...
81 82 83 84 |
# File 'lib/common/http/html_helper.rb', line 81 def find_scan(regexp,&blk) raise "no any http request before!" if @response == nil return @response.body.scan regexp,&blk end |
#find_size(xpath) ⇒ Object
功能:
使用XPath或CSS Selector来提取HTML元素集合的个数
参数解释:
-
xpath 定位,可以使xpath or selector
-
return 返回Fixnum
Example:
Example #1:
get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a> <a href=http://www.hao123.com>hao123</a> | <a href=/more/>更多<span style="font-family:宋体">>></span></a></p>
puts find_size "p#km a"
# => 3
# 上层的意义可能是“输入框下面这一行显示3个超链接
253 254 255 256 257 258 259 260 261 262 263 264 265 |
# File 'lib/common/http/html_helper.rb', line 253 def find_size(xpath) raise "no any http request before!" if @response == nil @hpricot ||= Hpricot(@response.body) elems = @hpricot.search(xpath) if block_given? index = -1 elems = elems.select do |elem| index = index + 1 yield elem,index end end return elems.size end |
#find_text(xpath) ⇒ Object
功能:
使用XPath或CSS Selector来提取text内容,不会包含HTML标签
参数解释:
-
xpath 定位,可以使xpath or selector
-
return 返回String
Example:
Example #1:
get "http://www.baidu.com/"
# HTML片段如下
# <p id=km><a href=http://hi.baidu.com>空间</a>
# <a href=http://www.hao123.com>hao123</a> | <a href=/more/>更多<span style="font-family:宋体">
# >></span></a></p>
puts find_text "p#km"
# => 空间 hao123 | 更多>>
puts find_text "//p[@id='km']"
# 与上同
167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
# File 'lib/common/http/html_helper.rb', line 167 def find_text(xpath) raise "no any http request before!" if @response == nil @hpricot ||= Hpricot(@response.body) elems = @hpricot.search(xpath) if block_given? index = -1 elems = elems.select do |elem| index = index + 1 yield elem,index end end raise HtmlError, "find_html xpath[#{xpath}] is not exist!" if elems.empty? return elems.first.inner_text end |
#hpricot_search(xpath, &blk) ⇒ Object
功能:
高级接口,将 Hpricot 暴露出来,请参看 Hpricot 的使用手册返回数组
参数解释:
-
xpath 定位,可以使xpath or selector
-
return 返回Hpricot::Elems
Example:
get "http://www.baidu.com"
hpricot_search("div").each |elem|
p elem
end
317 318 319 320 321 |
# File 'lib/common/http/html_helper.rb', line 317 def hpricot_search(xpath, &blk) raise "no any http request before!" if @response == nil @hpricot ||= Hpricot(@response.body) @hpricot.search(xpath, &blk) end |