scraper-central-ruby
Ruby library to scrape and cache the data
Usage
Add gem scraper-central-ruby
into Gemfile:
gem 'scraper-central-ruby', git: '[email protected]:patterninc/scraper-central-ruby.git', tag: 'v2.0.1'
bundle install
Add below configurations to service where this gem is being used.(ENV variables)
SERVER_URL_GET_CACHE=server base url for get cache
SERVER_URL_PUT_CACHE=server base url for put cache
New Scraper central object
Use ScraperCentral.new
to create new instance:
scraper_central = ScraperCentral.new
Options
Customize the ScraperCentral instance using the provided options:
scraper_central = ScraperCentral.new
scraper_central.timeout = 45
scraper_central.tls_verify = false
scraper_central.enable_js = true
scraper_central.proxy_name = "ProxyName"
scraper_central.retry_attr = {
count: 5,
wait_time: 5,
max_wait_time: 5
}
Customizing Headers and Query Parameters
To add custom headers and query parameters to your requests:
scraper_central = ScraperCentral.new
scraper_central.query_params = {
"query": "value",
"page_wait": "500"
}
scraper_central.headers = {
"X-Custom-Header": "value",
"mime-type": "text/html",
"Encoding": "gzip",
"ContentType": "application/json"
}
Cookies
Customizing cookies:
scraper_central = ScraperCentral.new
scraper_central. = `[
{
"Name":"amazon-cookie",
"Value":"Some session value",
"Path":"/",
"Domain":"amazon.com",
"MaxAge":36000,
"HttpOnly":true,
"Secure":false
},
{
"Name":"walmart-cookie",
"Value":"Some session value",
"Path":"/",
"Domain":"walmart.com",
"MaxAge":72000,
"HttpOnly":false,
"Secure":true
}
]`
S3 Key Structure
To generate S3 key we require following methods to set-up:
scraper_central = ScraperCentral.new
scraper_central.s3_key = {
country: "US",
marketplace: "Amazon",
page_type: "detail-page",
identifier: "B0BQZBPS4G",
page_number: 1
}
scraper_central.cache_duration = 360
Fetch Content and Headers
Fetches the URL through the proxy:
scraper_central = ScraperCentral.new
scraper_central.proxy_name = "BrightData"
scraper_central.cache_duration = 360
scraper_central.retry_attr = {
count: 5
}
scraper_central.s3_key = {
country: "US",
marketplace: "Amazon",
page_type: "product-question",
identifier: "B0BQZBPS4G",
page_number: 1
}
scraper_central.auth_config = {
client_id: 'client_id token',
client_secret: 'client_secret token'
}
response = scraper_central.fetch("https://example.com")
...
puts "Response: ", response.body
puts "Status Code: ", response.code
puts "Headers: ", response.headers
Documentation
Configuration Functions
scraper_central.proxy_name=
: Sets the proxy service name. e.g.CrawlBase
,BrightData
,ScraperApi
scraper_central.enable_js=
: Enables or disables JavaScript execution for proxies.scraper_central.retry_attr=
: Configures retry logic, including the number of attempts, wait time between attempts.scraper_central.timeout=
: Sets the request timeout in seconds.scraper_central.tls_verify=
: Configures TLS verification.scraper_central.enable_image_cache=
Enable/Disable image caching by setting true/falseProxy Methods
scraper_central.query_params=
: Sets query parameters to be appended to each request URL.scraper_central.headers=
: Adds custom headers to requests. (Accept, Accept-Encoding or Content-Type).scraper_central.cookies=
: Parses a JSON string of cookies and sets them for subsequent requests.
Auth configurations method
scraper_central.auth_config=
: Adds client_id and client_secret to generate auth token in gem and send it in headers while requesting the scraper central apis
Proxy Methods For S3 Key
scraper_central.s3_key=
: Updates the proxy's target country or S3 key structure country, Marketplace, Pre defined names for page views, e.g.detail-page
. S3 key value for page identifer e.g. ASIN or Product ID. PageNumber if page is paginated.scraper_central.cache_duration=
: Age of object stored on S3 bucket.
Get Content
response = fetch(url)
: Makes a configured HTTP request to the specified URL and returns the response object.