scraper-central-ruby

Ruby library to scrape and cache the data

Usage

Add gem scraper-central-ruby into Gemfile:

gem 'scraper-central-ruby', git: '[email protected]:patterninc/scraper-central-ruby.git', tag: 'v2.0.1'
bundle install

Add below configurations to service where this gem is being used.(ENV variables)

SERVER_URL_GET_CACHE=server base url for get cache
SERVER_URL_PUT_CACHE=server base url for put cache

New Scraper central object

Use ScraperCentral.new to create new instance:

scraper_central = ScraperCentral.new

Options

Customize the ScraperCentral instance using the provided options:

scraper_central = ScraperCentral.new
scraper_central.timeout = 45
scraper_central.tls_verify = false
scraper_central.enable_js = true
scraper_central.proxy_name = "ProxyName"
scraper_central.retry_attr = {
  count: 5,
  wait_time: 5,
  max_wait_time: 5
}

Customizing Headers and Query Parameters

To add custom headers and query parameters to your requests:

scraper_central = ScraperCentral.new
scraper_central.query_params = {
  "query": "value",
  "page_wait": "500"
}

scraper_central.headers = {
  "X-Custom-Header": "value",
  "mime-type": "text/html",
  "Encoding": "gzip",
  "ContentType": "application/json"
}

Cookies

Customizing cookies:

scraper_central = ScraperCentral.new
scraper_central.cookies = `[
  {
    "Name":"amazon-cookie",
    "Value":"Some session value",
    "Path":"/",
    "Domain":"amazon.com",
    "MaxAge":36000,
    "HttpOnly":true,
    "Secure":false
  },
  {
    "Name":"walmart-cookie",
    "Value":"Some session value",
    "Path":"/",
    "Domain":"walmart.com",
    "MaxAge":72000,
    "HttpOnly":false,
    "Secure":true
  }
]`

S3 Key Structure

To generate S3 key we require following methods to set-up:

scraper_central = ScraperCentral.new

scraper_central.s3_key = {
  country: "US",
  marketplace: "Amazon",
  page_type: "detail-page",
  identifier: "B0BQZBPS4G",
  page_number: 1
}
scraper_central.cache_duration = 360

Fetch Content and Headers

Fetches the URL through the proxy:

scraper_central = ScraperCentral.new
scraper_central.proxy_name = "BrightData"
scraper_central.cache_duration = 360
scraper_central.retry_attr = {
  count: 5
}
scraper_central.s3_key = {
  country: "US",
  marketplace: "Amazon",
  page_type: "product-question",
  identifier: "B0BQZBPS4G",
  page_number: 1
}
scraper_central.auth_config = {
  client_id: 'client_id token',
  client_secret: 'client_secret token'
}

response = scraper_central.fetch("https://example.com")

...

puts "Response: ", response.body
puts "Status Code: ", response.code
puts "Headers: ", response.headers

Documentation

Configuration Functions

  • scraper_central.proxy_name=: Sets the proxy service name. e.g. CrawlBase, BrightData, ScraperApi
  • scraper_central.enable_js=: Enables or disables JavaScript execution for proxies.
  • scraper_central.retry_attr=: Configures retry logic, including the number of attempts, wait time between attempts.
  • scraper_central.timeout=: Sets the request timeout in seconds.
  • scraper_central.tls_verify=: Configures TLS verification.
  • scraper_central.enable_image_cache= Enable/Disable image caching by setting true/false

    Proxy Methods

  • scraper_central.query_params=: Sets query parameters to be appended to each request URL.

  • scraper_central.headers=: Adds custom headers to requests. (Accept, Accept-Encoding or Content-Type).

  • scraper_central.cookies=: Parses a JSON string of cookies and sets them for subsequent requests.

Auth configurations method

  • scraper_central.auth_config=: Adds client_id and client_secret to generate auth token in gem and send it in headers while requesting the scraper central apis

Proxy Methods For S3 Key

  • scraper_central.s3_key=: Updates the proxy's target country or S3 key structure country, Marketplace, Pre defined names for page views, e.g. detail-page. S3 key value for page identifer e.g. ASIN or Product ID. PageNumber if page is paginated.
  • scraper_central.cache_duration=: Age of object stored on S3 bucket.

Get Content

  • response = fetch(url): Makes a configured HTTP request to the specified URL and returns the response object.