Class: ScraperUtils::MechanizeUtils::RobotsChecker
- Inherits:
-
Object
- Object
- ScraperUtils::MechanizeUtils::RobotsChecker
- Defined in:
- lib/scraper_utils/mechanize_utils/robots_checker.rb
Overview
robots.txt checker with deliberately simplistic rules
Instance Attribute Summary collapse
-
#user_agent ⇒ String
readonly
Lowercased user_agent for matching.
Instance Method Summary collapse
-
#crawl_delay ⇒ Integer?
Returns the crawl delay (if any) that applied to the last URL checked Should be called after disallowed? to get relevant delay.
-
#disallowed?(url) ⇒ Boolean
Check if a URL is disallowed based on robots.txt rules specific to our user agent.
-
#initialize(user_agent) ⇒ RobotsChecker
constructor
Initialize with full user agent string like: “Mozilla/5.0 (compatible; ScraperUtils/0.1.0 2025-02-22; +github.com/ianheggie-oaf/scraper_utils)” Extracts the bot name (e.g. “ScraperUtils”) to check against robots.txt Checks for * Disallow for User-agent: bot_name and * Crawl-delay from either User-agent: bot name or * (default).
Constructor Details
#initialize(user_agent) ⇒ RobotsChecker
Initialize with full user agent string like: “Mozilla/5.0 (compatible; ScraperUtils/0.1.0 2025-02-22; +github.com/ianheggie-oaf/scraper_utils)” Extracts the bot name (e.g. “ScraperUtils”) to check against robots.txt Checks for
-
Disallow for User-agent: bot_name and
-
Crawl-delay from either User-agent: bot name or * (default)
16 17 18 19 20 21 22 23 24 25 |
# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 16 def initialize(user_agent) @user_agent = extract_user_agent(user_agent).downcase if DebugUtils.basic? ScraperUtils::LogUtils.log( "Checking robots.txt for user agent prefix: #{@user_agent} (case insensitive)" ) end @rules = {} # domain -> {rules: [], delay: int} @delay = nil # Delay from last robots.txt check end |
Instance Attribute Details
#user_agent ⇒ String (readonly)
Returns Lowercased user_agent for matching.
8 9 10 |
# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 8 def user_agent @user_agent end |
Instance Method Details
#crawl_delay ⇒ Integer?
Returns the crawl delay (if any) that applied to the last URL checked Should be called after disallowed? to get relevant delay
51 52 53 |
# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 51 def crawl_delay @delay end |
#disallowed?(url) ⇒ Boolean
Check if a URL is disallowed based on robots.txt rules specific to our user agent
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
# File 'lib/scraper_utils/mechanize_utils/robots_checker.rb', line 30 def disallowed?(url) return false unless url uri = URI(url) domain = "#{uri.scheme}://#{uri.host}" path = uri.path || "/" # Get or fetch robots.txt rules rules = get_rules(domain) return false unless rules # If we can't get robots.txt, assume allowed # Store any delay found for this domain @delay = rules[:our_delay] # Check rules specific to our user agent matches_any_rule?(path, rules[:our_rules]) end |