Class: Arwen

Inherits:
Object
  • Object
show all
Defined in:
lib/arwen.rb,
lib/arwen/url.rb,
lib/arwen/version.rb

Overview

Parses a sitemap url and provides all links provided by the sitemap or sitemap_index. It uses Typheous for network requests and making concurrent requests when parsing a sitemap_index. Ox is the XML parser used to parse the sitemap. Sitemaps are assumed to follow the sitemaps.org protocol.

Defined Under Namespace

Classes: Url

Constant Summary collapse

VERSION =
"0.1.1"

Instance Method Summary collapse

Constructor Details

#initialize(url, opts = {}) ⇒ Arwen

Create a new Arwen instance

Parameters:

  • url (string)

    the full URL to the sitemap or sitemap_index XML file

  • opts (hash) (defaults to: {})

    options passed to Typheous::Request instances.

Options Hash (opts):

  • :max_concurrency (integer)

    maximum concurrent requests passed to Typheous::Hydra

See Also:



22
23
24
25
26
27
# File 'lib/arwen.rb', line 22

def initialize(url, opts = {})
  @url = url
  max_concurrency = opts.delete(:max_concurrency) { 200 }
  @opts = { followlocation: true }.merge(opts)
  @hydra = Typhoeus::Hydra.new(max_concurrency: max_concurrency)
end

Instance Method Details

#sitemapOx::Document

parses the sitemap url to an Ox::Document instance

Returns:

  • (Ox::Document)

See Also:



47
48
49
# File 'lib/arwen.rb', line 47

def sitemap
  @sitemap ||= raw_sitemap
end

#to_aArray<String>

returns an array of url strings for all URls in the sitemap

Returns:

  • (Array<String>)


39
40
41
# File 'lib/arwen.rb', line 39

def to_a
  urls.map(&:url)
end

#urlsArray<SitemapParser::Url>

fetches and returns all urls for the sitemap with corresponding <url> sitemap schema metadata

Returns:

  • (Array<SitemapParser::Url>)


32
33
34
# File 'lib/arwen.rb', line 32

def urls
  @urls ||= all_urls(sitemap)
end