Class: ScraperUtils::MechanizeUtils::RobotsChecker

Inherits:

Object

Object
ScraperUtils::MechanizeUtils::RobotsChecker

show all

Defined in:: lib/scraper_utils/mechanize_utils/robots_checker.rb

Overview

robots.txt checker with deliberately simplistic rules

Instance Attribute Summary collapse

#user_agent ⇒ String readonly

Lowercased user_agent for matching.

Instance Method Summary collapse

#crawl_delay ⇒ Integer^?

Returns the crawl delay (if any) that applied to the last URL checked Should be called after disallowed? to get relevant delay.
#disallowed?(url) ⇒ Boolean

Check if a URL is disallowed based on robots.txt rules specific to our user agent.
#initialize(user_agent) ⇒ RobotsChecker constructor

Initialize with full user agent string like: “Mozilla/5.0 (compatible; ScraperUtils/0.1.0 2025-02-22; +github.com/ianheggie-oaf/scraper_utils)” Extracts the bot name (e.g. “ScraperUtils”) to check against robots.txt Checks for * Disallow for User-agent: bot_name and * Crawl-delay from either User-agent: bot name or * (default).

Constructor Details

#initialize(user_agent) ⇒ `RobotsChecker`

Initialize with full user agent string like: “Mozilla/5.0 (compatible; ScraperUtils/0.1.0 2025-02-22; +github.com/ianheggie-oaf/scraper_utils)” Extracts the bot name (e.g. “ScraperUtils”) to check against robots.txt Checks for

Disallow for User-agent: bot_name and
Crawl-delay from either User-agent: bot name or * (default)

# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 16

def initialize(user_agent)
  @user_agent = extract_user_agent(user_agent).downcase
  if DebugUtils.basic?
    ScraperUtils::LogUtils.log(
      "Checking robots.txt for user agent prefix: #{@user_agent} (case insensitive)"
    )
  end
  @rules = {} # domain -> {rules: [], delay: int}
  @delay = nil # Delay from last robots.txt check
end

Instance Attribute Details

#user_agent ⇒ `String` (readonly)

Returns Lowercased user_agent for matching.

Returns:

(String) —

Lowercased user_agent for matching



8
9
10

# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 8

def user_agent
  @user_agent
end

Instance Method Details

#crawl_delay ⇒ `Integer`^?

Returns the crawl delay (if any) that applied to the last URL checked Should be called after disallowed? to get relevant delay

Returns:

(Integer, nil) —

The delay in seconds, or nil if no delay specified



51
52
53

# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 51

def crawl_delay
  @delay
end

#disallowed?(url) ⇒ `Boolean`

Check if a URL is disallowed based on robots.txt rules specific to our user agent

Parameters:

url (String) —

The full URL to check

Returns:

(Boolean) —

true if specifically blocked for our user agent, otherwise false

# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 30

def disallowed?(url)
  return false unless url

  uri = URI(url)
  domain = "#{uri.scheme}://#{uri.host}"
  path = uri.path || "/"

  # Get or fetch robots.txt rules
  rules = get_rules(domain)
  return false unless rules # If we can't get robots.txt, assume allowed

  # Store any delay found for this domain
  @delay = rules[:our_delay]

  # Check rules specific to our user agent
  matches_any_rule?(path, rules[:our_rules])
end

Class: ScraperUtils::MechanizeUtils::RobotsChecker

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(user_agent) ⇒ RobotsChecker

Instance Attribute Details

#user_agent ⇒ String (readonly)

Instance Method Details

#crawl_delay ⇒ Integer?

#disallowed?(url) ⇒ Boolean

#initialize(user_agent) ⇒ `RobotsChecker`

#user_agent ⇒ `String` (readonly)

#crawl_delay ⇒ `Integer`^?

#disallowed?(url) ⇒ `Boolean`