Module: ScraperUtils::MechanizeUtils
- Defined in:
- lib/scraper_utils/mechanize_utils.rb,
lib/scraper_utils/mechanize_utils/agent_config.rb,
lib/scraper_utils/mechanize_utils/adaptive_delay.rb,
lib/scraper_utils/mechanize_utils/robots_checker.rb
Overview
Utilities for configuring and using Mechanize for web scraping
Defined Under Namespace
Classes: AdaptiveDelay, AgentConfig, RobotsChecker
Constant Summary collapse
- PUBLIC_IP_URL =
"https://whatismyip.akamai.com/"
- HEADERS_ECHO_URL =
"https://httpbin.org/headers"
Class Method Summary collapse
-
.find_maintenance_message(page) ⇒ String?
Checks if a page indicates a maintenance mode.
-
.mechanize_agent(**options) ⇒ Mechanize
Creates and configures a Mechanize agent.
-
.public_headers(agent = nil, force: false) ⇒ String?
Retrieves and logs the headers that make it through the proxy.
-
.public_ip(agent = nil, force: false) ⇒ String?
Retrieves and logs the public IP address.
-
.using_proxy?(agent) ⇒ Boolean
Returns if the Mechanize agent is using the proxy.
Class Method Details
.find_maintenance_message(page) ⇒ String?
Checks if a page indicates a maintenance mode
36 37 38 39 40 41 42 43 44 45 |
# File 'lib/scraper_utils/mechanize_utils.rb', line 36 def self.(page) # Use Nokogiri for parsing because earlier versions of Mechanize # do not support the .search method on page objects doc = Nokogiri::HTML(page.body) doc.css("h1, title").each do |element| text = element.inner_text return "Maintenance: #{text}" if text&.match?(/maintenance/i) end nil end |
.mechanize_agent(**options) ⇒ Mechanize
Creates and configures a Mechanize agent
19 20 21 22 23 24 25 |
# File 'lib/scraper_utils/mechanize_utils.rb', line 19 def self.mechanize_agent(**) agent = Mechanize.new config = AgentConfig.new(**) config.configure_agent(agent) agent.instance_variable_set(:@scraper_utils_config, config) agent end |
.public_headers(agent = nil, force: false) ⇒ String?
Retrieves and logs the headers that make it through the proxy
66 67 68 69 70 71 72 73 |
# File 'lib/scraper_utils/mechanize_utils.rb', line 66 def self.public_headers(agent = nil, force: false) @public_headers = nil if force @public_headers ||= begin response = agent&.get(HEADERS_ECHO_URL) response&.body&.strip end @public_headers end |
.public_ip(agent = nil, force: false) ⇒ String?
Retrieves and logs the public IP address
52 53 54 55 56 57 58 59 |
# File 'lib/scraper_utils/mechanize_utils.rb', line 52 def self.public_ip(agent = nil, force: false) @public_ip = nil if force @public_ip ||= begin response = agent&.get(PUBLIC_IP_URL) response&.body&.strip end @public_ip end |
.using_proxy?(agent) ⇒ Boolean
Returns if the Mechanize agent is using the proxy
28 29 30 |
# File 'lib/scraper_utils/mechanize_utils.rb', line 28 def self.using_proxy?(agent) !agent.agent.proxy_uri.nil? end |