Module: ScraperUtils::MechanizeUtils

Defined in:
lib/scraper_utils/mechanize_utils.rb,
lib/scraper_utils/mechanize_utils/agent_config.rb,
lib/scraper_utils/mechanize_utils/adaptive_delay.rb,
lib/scraper_utils/mechanize_utils/robots_checker.rb

Overview

Utilities for configuring and using Mechanize for web scraping

Defined Under Namespace

Classes: AdaptiveDelay, AgentConfig, RobotsChecker

Constant Summary collapse

PUBLIC_IP_URL =
"https://whatismyip.akamai.com/"
HEADERS_ECHO_URL =
"https://httpbin.org/headers"

Class Method Summary collapse

Class Method Details

.find_maintenance_message(page) ⇒ String?

Checks if a page indicates a maintenance mode

Parameters:

  • page (Mechanize::Page)

    The web page to check

Returns:

  • (String, nil)

    Maintenance message if found, otherwise nil



36
37
38
39
40
41
42
43
44
45
# File 'lib/scraper_utils/mechanize_utils.rb', line 36

def self.find_maintenance_message(page)
  # Use Nokogiri for parsing because earlier versions of Mechanize
  # do not support the .search method on page objects
  doc = Nokogiri::HTML(page.body)
  doc.css("h1, title").each do |element|
    text = element.inner_text
    return "Maintenance: #{text}" if text&.match?(/maintenance/i)
  end
  nil
end

.mechanize_agent(**options) ⇒ Mechanize

Creates and configures a Mechanize agent

Parameters:

  • timeout (Integer, nil)

    Timeout for agent connections (default: 60)

  • compliant_mode (Boolean, nil)

    Comply with headers and robots.txt (default: true)

  • random_delay (Integer, nil)

    Average random delay in seconds (default: 3)

  • max_load (Float, nil)

    Maximum server load percentage (nil = no delay, default: 20%) When compliant_mode is true, max_load is capped at 33%

  • disable_ssl_certificate_check (Boolean, nil)

    Skip SSL verification (default: false)

  • australian_proxy (Boolean, nil)

    Use proxy if available (default: false)

  • user_agent (String, nil)

    Configure Mechanize user agent

Returns:

  • (Mechanize)

    Configured Mechanize agent



19
20
21
22
23
24
25
# File 'lib/scraper_utils/mechanize_utils.rb', line 19

def self.mechanize_agent(**options)
  agent = Mechanize.new
  config = AgentConfig.new(**options)
  config.configure_agent(agent)
  agent.instance_variable_set(:@scraper_utils_config, config)
  agent
end

.public_headers(agent = nil, force: false) ⇒ String?

Retrieves and logs the headers that make it through the proxy

Parameters:

  • agent (Mechanize, nil) (defaults to: nil)

    Mechanize agent to use for IP find or nil when clearing cache

  • force (Boolean) (defaults to: false)

    Force a new IP find, by clearing cache first

Returns:

  • (String, nil)

    The list of headers in json format



66
67
68
69
70
71
72
73
# File 'lib/scraper_utils/mechanize_utils.rb', line 66

def self.public_headers(agent = nil, force: false)
  @public_headers = nil if force
  @public_headers ||= begin
    response = agent&.get(HEADERS_ECHO_URL)
    response&.body&.strip
  end
  @public_headers
end

.public_ip(agent = nil, force: false) ⇒ String?

Retrieves and logs the public IP address

Parameters:

  • agent (Mechanize, nil) (defaults to: nil)

    Mechanize agent to use for IP find or nil when clearing cache

  • force (Boolean) (defaults to: false)

    Force a new IP find, by clearing cache first

Returns:

  • (String, nil)

    The public IP address



52
53
54
55
56
57
58
59
# File 'lib/scraper_utils/mechanize_utils.rb', line 52

def self.public_ip(agent = nil, force: false)
  @public_ip = nil if force
  @public_ip ||= begin
    response = agent&.get(PUBLIC_IP_URL)
    response&.body&.strip
  end
  @public_ip
end

.using_proxy?(agent) ⇒ Boolean

Returns if the Mechanize agent is using the proxy

Returns:

  • (Boolean)


28
29
30
# File 'lib/scraper_utils/mechanize_utils.rb', line 28

def self.using_proxy?(agent)
  !agent.agent.proxy_uri.nil?
end