Class: ScraperUtils::MechanizeUtils::AdaptiveDelay

Inherits:
Object
  • Object
show all
Defined in:
lib/scraper_utils/mechanize_utils/adaptive_delay.rb

Overview

Adapts delays between requests based on server response times. Target delay is proportional to response time based on max_load setting. Uses an exponential moving average to smooth variations in response times.

Constant Summary collapse

DEFAULT_MIN_DELAY =
0.0
DEFAULT_MAX_DELAY =

Presumed default timeout for Mechanize

30.0

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(min_delay: DEFAULT_MIN_DELAY, max_delay: DEFAULT_MAX_DELAY, max_load: AgentConfig::DEFAULT_MAX_LOAD) ⇒ AdaptiveDelay

Creates a new adaptive delay calculator

Parameters:

  • min_delay (Float) (defaults to: DEFAULT_MIN_DELAY)

    Minimum delay between requests in seconds

  • max_delay (Float) (defaults to: DEFAULT_MAX_DELAY)

    Maximum delay between requests in seconds

  • max_load (Float) (defaults to: AgentConfig::DEFAULT_MAX_LOAD)

    Maximum load percentage (1..Constants::MAX_LOAD_CAP) we aim to place on the server Lower values are more conservative (e.g., 20% = 4x response time delay)



23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/scraper_utils/mechanize_utils/adaptive_delay.rb', line 23

def initialize(min_delay: DEFAULT_MIN_DELAY, max_delay: DEFAULT_MAX_DELAY, max_load: AgentConfig::DEFAULT_MAX_LOAD)
  @delays = {} # domain -> last delay used
  @min_delay = min_delay.to_f
  @max_delay = max_delay.to_f
  @max_load = max_load.to_f.clamp(1.0, AgentConfig::MAX_LOAD_CAP)
  @response_multiplier = (100.0 - @max_load) / @max_load

  return unless DebugUtils.basic?

  ScraperUtils::LogUtils.log(
    "AdaptiveDelay initialized with delays between #{@min_delay} and #{@max_delay} seconds, " \
      "Max_load #{@max_load}% thus response multiplier: #{@response_multiplier.round(2)}x"
  )
end

Instance Attribute Details

#max_delayObject (readonly)

Returns the value of attribute max_delay.



15
16
17
# File 'lib/scraper_utils/mechanize_utils/adaptive_delay.rb', line 15

def max_delay
  @max_delay
end

#max_loadObject (readonly)

Returns the value of attribute max_load.



15
16
17
# File 'lib/scraper_utils/mechanize_utils/adaptive_delay.rb', line 15

def max_load
  @max_load
end

#min_delayObject (readonly)

Returns the value of attribute min_delay.



15
16
17
# File 'lib/scraper_utils/mechanize_utils/adaptive_delay.rb', line 15

def min_delay
  @min_delay
end

Instance Method Details

#delay(uri) ⇒ Float

Returns Current delay for the domain, or min_delay if no delay set.

Parameters:

  • uri (URI::Generic, String)

    URL to get delay for

Returns:

  • (Float)

    Current delay for the domain, or min_delay if no delay set



40
41
42
# File 'lib/scraper_utils/mechanize_utils/adaptive_delay.rb', line 40

def delay(uri)
  @delays[domain(uri)] || @min_delay
end

#next_delay(uri, response_time) ⇒ Float

Returns the next_delay calculated from a smoothed average of response_time to use less than max_load% of server

Parameters:

  • uri (URI::Generic, String)

    URL the response came from

  • response_time (Float)

    Time in seconds the server took to respond

Returns:

  • (Float)

    The calculated delay to use with the next request



49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/scraper_utils/mechanize_utils/adaptive_delay.rb', line 49

def next_delay(uri, response_time)
  uris_domain = domain(uri)
  # calculate target_delay to achieve desired max_load% using pre-calculated multiplier
  target_delay = (response_time * @response_multiplier).clamp(0.0, @max_delay)
  # Initialise average from initial_response_time rather than zero to start with reasonable approximation
  current_delay = @delays[uris_domain] || target_delay
  # exponential smooth the delay to smooth out wild swings (Equivalent to an RC low pass filter)
  delay = ((3.0 * current_delay) + target_delay) / 4.0
  delay = delay.clamp(@min_delay, @max_delay)

  if DebugUtils.basic?
    ScraperUtils::LogUtils.log(
      "Adaptive delay for #{uris_domain} updated to #{delay.round(2)}s (target: " \
        "#{@response_multiplier.round(1)}x response_time of #{response_time.round(2)}s)"
    )
  end

  @delays[uris_domain] = delay
  delay
end