Mechanize Utilities
This document provides detailed information about the Mechanize utilities provided by ScraperUtils.
MechanizeUtils
The ScraperUtils::MechanizeUtils
module provides utilities for configuring and using Mechanize for web scraping.
Creating a Mechanize Agent
agent = ScraperUtils::MechanizeUtils.mechanize_agent(**)
Configuration Options
Add client_options
to your AUTHORITIES configuration and move any of the following settings into it:
timeout: Integer
- Timeout for agent connections in case the server is slower than normalaustralian_proxy: true
- Use the proxy url in theMORPH_AUSTRALIAN_PROXY
env variable if the site is geo-lockeddisable_ssl_certificate_check: true
- Disabled SSL verification for old / incorrect certificates
Then adjust your code to accept client_options
and pass then through to:
ScraperUtils::MechanizeUtils.mechanize_agent(client_options || {})
to receive a Mechanize::Agent
configured accordingly.
The agent returned is configured using Mechanize hooks to implement the desired delays automatically.
Default Configuration
By default, the Mechanize agent is configured with the following settings. As you can see, the defaults can be changed using env variables or via code.
Note - compliant mode forces max_load to be set to a value no greater than 50.
ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
config.default_timeout = ENV.fetch('MORPH_TIMEOUT', DEFAULT_TIMEOUT).to_i # 60
config.default_compliant_mode = ENV.fetch('MORPH_NOT_COMPLIANT', nil).to_s.empty? # true
config.default_random_delay = ENV.fetch('MORPH_RANDOM_DELAY', DEFAULT_RANDOM_DELAY).to_i # 0
config.default_max_load = ENV.fetch('MORPH_MAX_LOAD',DEFAULT_MAX_LOAD).to_f # 50.0
config.default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
config.default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
config.default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
end
For full details, see the MechanizeUtils class documentation.
MechanizeActions
The ScraperUtils::MechanizeActions
class provides a convenient way to execute a series of actions (like clicking links, filling forms) on a Mechanize page.
Action Format
actions = [
[:click, "Find an application"],
[:click, ["Submitted Last 28 Days", "Submitted Last 7 Days"]],
[:block, ->(page, args, agent, results) { [new_page, result_data] }]
]
processor = ScraperUtils::MechanizeActions.new(agent)
result_page = processor.process(page, actions)
Supported Actions
:click
- Clicks on a link or element matching the provided selector:block
- Executes a custom block of code for complex scenarios
Selector Types
- Text selector (default):
"Find an application"
- CSS selector:
"css:.button"
- XPath selector:
"xpath://a[@class='button']"
Replacements
You can use replacements in your action parameters:
replacements = { FROM_DATE: "2022-01-01", TO_DATE: "2022-03-01" }
processor = ScraperUtils::MechanizeActions.new(agent, replacements)
# Use replacements in actions
actions = [
[:click, "Search between {FROM_DATE} and {TO_DATE}"]
]
For full details, see the MechanizeActions class documentation.