Module: Ariel

Defined in:
lib/ariel.rb,
lib/ariel/rule.rb,
lib/ariel/token.rb,
lib/ariel/learner.rb,
lib/ariel/rule_set.rb,
lib/ariel/node_like.rb,
lib/ariel/wildcards.rb,
lib/ariel/label_utils.rb,
lib/ariel/token_stream.rb,
lib/ariel/extracted_node.rb,
lib/ariel/structure_node.rb,
lib/ariel/candidate_selector.rb,
lib/ariel/example_document_loader.rb

Overview

Ariel - A Ruby Information Extraction Library

Ariel intends to assist in extracting information from semi-structured documents including (but not in any way limited to) web pages. Although you may use libraries such as Hpricot or Rubyful Soup, or even plain Regular Expressions to achieve the same goal, Ariel approaches the problem very differently. Ariel relies on the user labeling examples of the data they want to extract, and then finds patterns across several such labeled examples in order to produce a set of general rules for extracting this information from any similar document.

When working with Ariel, your workflow might look something like this:

  1. Define a structure for the data you wish to extract. For example:

    @structure = Ariel::StructureNode.new do |r|
      r.article do |a|
        a.title
        a.author
        a.date
        a.body
      end
      r.comment_list do |c|
        c.author
        c.date
        c.body
      end
    end
    
  2. Label these fields in a few example documents (normally at least 3). Labels are in the form of <l:label_name>...</l:label_name>

  3. Ariel will read these examples, and try to generate suitable rules that can be used to extract this data from other similarly structured documents.

  4. A wrapper has been generated - we can now happily load documents with the same structure (normally documents generated by the same rules, so different pages from a single site perhaps) and query the extracted data.

Defined Under Namespace

Modules: LabelUtils, NodeLike Classes: CandidateSelector, ExampleDocumentLoader, ExtractedNode, Learner, Rule, RuleSet, StructureNode, Token, TokenStream, Wildcards