Ariel release 0.0.1

Install

gem install ariel

Announcement

This is the first public release of Ariel - A Ruby Information Extraction Library. See my previous post, ruby-talk:200140 for more background information. This release supports defining a tree document structure and learning rules to extract each node of this true. Handling of list extraction and learning is not yet implemented, and is the next immediate priority. See the examples directory included in this release and below for discussion of the included examples. Rule learning is functional, and appears to work well, but many refinements are possible. Look out for more updates and a new releases shortly.

About Ariel

Ariel intends to assist in extracting information from semi-structured documents including (but not in any way limited to) web pages. Although you may use libraries such as Hpricot or Rubyful Soup, or even plain Regular Expressions to achieve the same goal, Ariel approaches the problem very differently. Ariel relies on the user labeling examples of the data they want to extract, and then finds patterns across several such labeled examples in order to produce a set of general rules for extracting this information from any similar document. It uses the MIT license.

Examples

This release includes two examples in the example directory (which should now be in the directory to which rubygems installed ariel). The first is the google_calculator directory (inspired by Justin Bailey’s post to my Ariel progress report). The structure is very simple, a calculation is extracted from the page, and then the actual result is extracted from that calculation. 3 labeled examples are included. Ariel reads each of these, tokenizes them, and extracts each label. 4 sets of rules are learnt:

  1. Rules to locate the start of the calculation in the original document.

  2. Rules to locate the end of the calculation in the original document (applied from the end of the document).

  3. Rules to locate the start of the result of the calculation from the extracted calculation.

  4. Rules to locate the end of the result of the calculation from the extracted calculation (applied from the end of the calculation).

Take note of 3 and 4 - this is the advantage of treating a document as a tree in this way. Deeply nested elements can be located by generating a series of simple rules, rather than generating a rule with complexity that increases at each level. Sets of rules are generated because it may not be possible to generate a single rule that will catch all cases. A rule is found that matches as many of the examples as possible (and fails on the rest), these examples are then removed and a rule is found that will match as many of the remaining examples and so on. When it comes to applying these learnt rules, the rules are applied in order until there is a rule that matches.

To see this example for yourself just execute structure.rb in the examples/google_calculator directory to create a locally writable structure.yaml. Then do:

ariel -D -m learn -s structure.yaml -d /path/to/examples/google_calculator/labeled

You’ll have to wait a while (see my note about performance below). At the end, the learnt rules will be printed in YAML format, and structure.yaml will be updated to include these rules. Apply these learnt rules to some unlabeled documents by doing:

ariel -D -m extract -s structure.yaml -d /path/to/examples/google_calculator/unlabeled

You should see the results of a successful extraction printed to your terminal, such as this one:

Results for unlabeled/2:
calculation: 3.5 U.S. dollars = 1.8486241 British pounds
result: 1.8486241 British pounds

The second example (raa) learns rules using just 2 labeled examples. This is probably fewer than I’d recommend in most cases, but as it works… This example consists of project entries in the Ruby Application Archive. The structure of the page is very flat, so all rules are applied to the full page. Rules are learnt and applied as shown above. The structure.yaml files included in the examples directories already include rules generated by Ariel, use these if you just want to see extraction working.

Note: The interface demonstrated by ariel above is not very flexible or friendly, it’s just to serve as a demonstration for the moment.

Performance

Generating rules takes quite a long time. It is always going to be an intensive operation, but there are some very simple and obvious improvements in efficiency that can be made. For a start, the rule candidate refining process currently re-applies the same rules over and over every time the remaining rule candidates are ranked. This is where most time is spent, and caching these should make a big difference. This will definitely be implemented. Other performance enhancements are bound to be there, but my focus at this time is to get something that works.

Credits

Ariel is developed by Alex Bradbury as a Google Summer of Code project under the mentoring of Austin Ziegler.

Watch my development through the subversion repository at rubyforge.org/projects/ariel I’ve also just started using the tracker at code.google.com/p/ariel/