Class: ScraperUtils::MechanizeUtils::RobotsChecker

Inherits:
Object
  • Object
show all
Defined in:
lib/scraper_utils/mechanize_utils/robots_checker.rb

Overview

robots.txt checker with deliberately simplistic rules

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(user_agent) ⇒ RobotsChecker

Initialize with full user agent string like: “Mozilla/5.0 (compatible; ScraperUtils/0.1.0 2025-02-22; +github.com/ianheggie-oaf/scraper_utils)” Extracts the bot name (e.g. “ScraperUtils”) to check against robots.txt Checks for

  • Disallow for User-agent: bot_name and

  • Crawl-delay from either User-agent: bot name or * (default)



16
17
18
19
20
21
22
23
24
25
# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 16

def initialize(user_agent)
  @user_agent = extract_user_agent(user_agent).downcase
  if DebugUtils.basic?
    ScraperUtils::LogUtils.log(
      "Checking robots.txt for user agent prefix: #{@user_agent} (case insensitive)"
    )
  end
  @rules = {} # domain -> {rules: [], delay: int}
  @delay = nil # Delay from last robots.txt check
end

Instance Attribute Details

#user_agentString (readonly)

Returns Lowercased user_agent for matching.

Returns:

  • (String)

    Lowercased user_agent for matching



8
9
10
# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 8

def user_agent
  @user_agent
end

Instance Method Details

#crawl_delayInteger?

Returns the crawl delay (if any) that applied to the last URL checked Should be called after disallowed? to get relevant delay

Returns:

  • (Integer, nil)

    The delay in seconds, or nil if no delay specified



51
52
53
# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 51

def crawl_delay
  @delay
end

#disallowed?(url) ⇒ Boolean

Check if a URL is disallowed based on robots.txt rules specific to our user agent

Parameters:

  • url (String)

    The full URL to check

Returns:

  • (Boolean)

    true if specifically blocked for our user agent, otherwise false



30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 30

def disallowed?(url)
  return false unless url

  uri = URI(url)
  domain = "#{uri.scheme}://#{uri.host}"
  path = uri.path || "/"

  # Get or fetch robots.txt rules
  rules = get_rules(domain)
  return false unless rules # If we can't get robots.txt, assume allowed

  # Store any delay found for this domain
  @delay = rules[:our_delay]

  # Check rules specific to our user agent
  matches_any_rule?(path, rules[:our_rules])
end