Class: Cass::Parser

Inherits:
Object
  • Object
show all
Defined in:
lib/cass/parser.rb

Overview

Parses a string (e.g., text read from a file) into sentences. Can use either the Stanford Natural Language Parser (if installed), or a barebones parser that splits text at line breaks and periods. Generally speaking, you shouldn’t rely on the Parser class to parse and sanitize your input texts for you. This class implements only barebones functionality, and there’s no guarantee the resulting text will look the way you want it. You are strongly encouraged to process all texts yourself beforehand, and use this functionality only as a last resort.

Class Method Summary collapse

Class Method Details

.parse(text, opts = {}) ⇒ Object

Parses a string into sentences.If the Stanford Parser and associated Ruby gem are installed (stanfordparser.rubyforge.org/), they will be called to do the job. If not, only basic parsing will be performed: text will be split into sentences at newlines and periods. Note that this is suboptimal and may generate problems for some documents.



20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# File 'lib/cass/parser.rb', line 20

def self.parse(text, opts={})
  # Try to load Stanford Parser wrapper
  begin
    require 'stanfordparser'
  rescue LoadError
    puts "Error: stanfordparser gem couldn't load. Using barebones parsing mode instead. If you'd like to use" +
        " the Stanford Parser, make sure all required components are installed (see http://stanfordparser.rubyforge.org/). You'll need to make sure the java library is installed, as well as the treebank and jrb gems."
    spfail = true
  end

  if spfail or opts['parser_basic'] == true
    puts "Using a basic parser to split text into sentences. Note that this is intended as a last resort only; you are strongly encouraged to process all input texts yourself and make sure that lines are broken up the way you want them to be (with each line on a new line of text in the file). If you use this parser, we make no guarantees about the quality of the output."
    rx = opts.key?('parser_regex') ? opts['parser_regex'] : "[\r\n\.]+"
    text.split(/#{rx}/)
  else
    puts "Using the Stanford Parser to parse the text. Note that this could take a long time for large files!" if (defined?(VERBOSE) and VERBOSE)
    parser = StanfordParser::DocumentPreprocessor.new
    parser.getSentencesFromString(text)
  end
end