Class: Cass::Document

Inherits:

Object

Object
Cass::Document

show all

Defined in:: lib/cass/document.rb

Overview

A Document object represents a single document– can be either an entire file, or a subset.

Instance Attribute Summary collapse

#clines ⇒ Object

Returns the value of attribute clines.
#context ⇒ Object

Returns the value of attribute context.
#lines ⇒ Object

Returns the value of attribute lines.
#name ⇒ Object

Returns the value of attribute name.
#targets ⇒ Object

Returns the value of attribute targets.
#text ⇒ Object

Returns the value of attribute text.
#tindex ⇒ Object

Returns the value of attribute tindex.
#unique ⇒ Object

Returns the value of attribute unique.

Instance Method Summary collapse

#compact ⇒ Object

Drop all words that aren’t in target list or context.
#cooccurrence(normalize_weights = false) ⇒ Object

Computes co-occurrence matrix between target words and the context.
#initialize(name, targets, text, opts = {}) ⇒ Document constructor

Create a new Document.
#pairwise_similarity(filename) ⇒ Object

Computes the pairwise similarity between all possible target pairs and saves teh results to the specified file.
#parse(opts = {}) ⇒ Object

Parse raw text into sentences.
#permute(clines = false) ⇒ Object

Randomly reorder lines of text.
#permute!(clines = false) ⇒ Object

Same as permute, but replaces contents of current document.
#resample(clines = false, n = nil) ⇒ Object

Resample n lines WITH replacement from text (for bootstrapping).
#resample!(clines = false, n = nil) ⇒ Object

Same as resample, but replaces contents of current document.
#similarity(words) ⇒ Object

Return the requested subset of the similarity matrix.
#split(n = 10, recalc = true) ⇒ Object

Split Document into n smaller random subsets, recalculating the context each time.
#summary(filename = nil, list_context = false, word_count = false) ⇒ Object

Print out summary information about the document.
#trim! ⇒ Object

Trim internal list of lines, keeping only those that contain at least one target word.

Constructor Details

#initialize(name, targets, text, opts = {}) ⇒ `Document`

Create a new Document. Three arguments are required:

name: The name of the document (defaults to the filename)
targets: Either an array of target words, or an array of Contrasts from which targets will be extracted
text: A string of text (the contents of the document)

The following (optional) settings can be passed in an options hash as the fourth argument:

context: A Context object to use (by default, a new one will be constructed)
skip_preproc: Skip most text preprocessing steps. This should only ever be used when creating a document derived from another document, where the text has already been processed.
max_lines: Maximum number of lines to use from the provided text. Note that this limit applies to the number of lines in the input text, NOT the number retained for analysis. By default, will use all lines.
recode: a hash containing words to recode in the text prior to analysis. For instance, if the key=>value pair ‘liberal’=>‘democrat’ is passed, all occurrences of ‘liberal’ will be replaced with ‘democrat’. This is useful when you want to analyze several words together as a single category, or for combining singular and plural forms of a word.
keep_case: By default, all words will be converted to lowercase. Passing this key in the options hash will preserve case in the text. Note that this will cause different cases of the same word to be handled as different words.
keep_special: By default, all non-alphabetical characters will be removed. Use this flag if you want special characters to be retained, with the same caveat as for keep_case.
parse_text: By default, it’s assumed that the text is already broken up into sentences at desired boundaries (one sentence per line). If the parse_text key is passed, a parser will be called. Note that USING THIS OPTION IS NOT RECOMMENDED. You should generally preprocess the input text yourself to ensure it looks right before submitting it to analysis.
parser_basic: If parse_text is on, the Parser will try to call the Stanford Parser by default. If the Stanford Parser isn’t installed properly, a backup basic parser will be used. Including the parser_basic flag will automatically revert to the basic parser instead of attempting to use the Stanford parser.
parser_regex: A custom regular expression that will be handed to the basic parser. Lines will be split at matches to the regex instead of the default (splitting only at newlines and periods). Note that parser_basic and parse_text must both be passed for this to work.

# File 'lib/cass/document.rb', line 24

def initialize(name, targets, text, opts={})

  # Error checking...
  if name.nil?
    abort("Error: document has no name!")
  elsif targets.nil? or targets.class != Array or targets.empty?
    abort("Error: invalid target specification; targets must be an array of words or Contrasts.")
  elsif text.nil?
    abort("Error: no text provided!")
  end

  # Set/initialize instance variables
  @name, @text, @tindex = name, text, {}

  # Get list of words from contrasts if necessary
  @targets = 
  if targets[0].class == Contrast
    targets = contrasts.inject([]) { |t, c| t += c.words.flatten }.uniq
  else
    targets
  end

  # Index targets, parse text, and create Context
  @targets.each_index { |i| @tindex[@targets[i]] = i }
  parse(opts)
  @context = context.nil? ? Context.new(self, opts) : @context
end

Instance Attribute Details

#clines ⇒ `Object`

Returns the value of attribute clines.



7
8
9

# File 'lib/cass/document.rb', line 7

def clines
  @clines
end

#context ⇒ `Object`

Returns the value of attribute context.



7
8
9

# File 'lib/cass/document.rb', line 7

def context
  @context
end

#lines ⇒ `Object`

Returns the value of attribute lines.



7
8
9

# File 'lib/cass/document.rb', line 7

def lines
  @lines
end

#name ⇒ `Object`

Returns the value of attribute name.



7
8
9

# File 'lib/cass/document.rb', line 7

def name
  @name
end

#targets ⇒ `Object`

Returns the value of attribute targets.



7
8
9

# File 'lib/cass/document.rb', line 7

def targets
  @targets
end

#text ⇒ `Object`

Returns the value of attribute text.



7
8
9

# File 'lib/cass/document.rb', line 7

def text
  @text
end

#tindex ⇒ `Object`

Returns the value of attribute tindex.



7
8
9

# File 'lib/cass/document.rb', line 7

def tindex
  @tindex
end

#unique ⇒ `Object`

Returns the value of attribute unique.



7
8
9

# File 'lib/cass/document.rb', line 7

def unique
  @unique
end

Instance Method Details

#compact ⇒ `Object`

Drop all words that aren’t in target list or context. Store as an array of arrays, with first element = array of targets and second = array of context words.

# File 'lib/cass/document.rb', line 147

def compact
	puts "Compacting all lines..." if VERBOSE
	@clines = []
	@lines.each { |l|
		w = l.split(/\s+/).uniq
		targs = w.select { |s| @tindex.key?(s) }
		conts = w.delete_if { |s| !@context.key?(s) }
		@clines << [targs, conts]
	}
end

#cooccurrence(normalize_weights = false) ⇒ `Object`

Computes co-occurrence matrix between target words and the context. Stores a target-by-context integer matrix internally.

# File 'lib/cass/document.rb', line 160

def cooccurrence(normalize_weights=false)
  # puts "Generating co-occurrence matrix..." if VERBOSE
  coocc = NMatrix.float(@targets.size, @context.size)
compact if @clines.nil?
  lc = 0  # line counter
  @clines.each { |l|
	targs, conts = l
	targs.each { |t|
		conts.each { |c|
			next if t == c
			incr = normalize_weights ? 1.0/conts.size : 1
			coocc[@tindex[t], @context[c]] = coocc[@tindex[t], @context[c]].to_f + incr
		}
	}
  }
  @cooc = coocc  #.to_f
  @corr = @cooc.corr#.collect { |i| i*i }  # Uncomment second half of line to square coefficients.
  #p @corr.to_a
  self
end

#pairwise_similarity(filename) ⇒ `Object`

Computes the pairwise similarity between all possible target pairs and saves teh results to the specified file. Note that this will produce an unmanageably large file if the number of targets is very large! The returned string contains tab-delimited columns for:

Document name
First word in pair
Second word in pair
Similarity value (correlation)

# File 'lib/cass/document.rb', line 198

def pairwise_similarity(filename)      
  abort("Error: you must compute the similarity matrix first!") if @corr.nil?
  outf = File.new(filename, 'w')
  outf.sync = true
  ind = @tindex.invert # For looking up words
  outf.puts %w[Document Word1 Word2 Correlation].join("\t")
  (dim = @corr.shape[0]).times { |i|
    i.upto(dim-1) { |j| outf.puts [@name, ind[i], ind[j], @corr[i,j]] }
  }
end

#parse(opts = {}) ⇒ `Object`

Parse raw text into sentences. When orig == false (e.g., when bootstrapping or splitting a document), skip certain preprocessing steps, on the assumption that these have already been performed.

# File 'lib/cass/document.rb', line 56

def parse(opts={})
  if opts.key?('skip_preproc')
    @lines = (text.class == Array) ? @text : text.split(/[\r\n]+/)
  else
    puts "Converting to lowercase..." if VERBOSE
    @text.downcase! unless opts.key?('keep_case')
    @text.gsub!(/[^a-z \n]+/, '') unless opts.key('keep_special')
    if opts.key?('recode')
      puts "Recoding words..." if VERBOSE
      opts['recode'].each { |k,v| @text.gsub!(/(^|\s+)(#{k})($|\s+)/, "\\1#{v}\\3") }
    end
    puts "Parsing text..." if VERBOSE
    @lines = opts.key?('parse_text') ? Parser.parse(@text, opts) : @text.split(/[\r\n]+/)
    @lines = @lines[0, opts['max_lines']] if opts.key?('max_lines')
    trim!
  end
end

#permute(clines = false) ⇒ `Object`

Randomly reorder lines of text. If clines is true, use the compacted lines variable, otherwise use all lines.



88
89
90

# File 'lib/cass/document.rb', line 88

def permute(clines=false)
  clines ? @clines.sort_by {rand} : @lines.sort_by {rand}
end

#permute!(clines = false) ⇒ `Object`

Same as permute, but replaces contents of current document.

# File 'lib/cass/document.rb', line 93

def permute!(clines=false)
  self.instance_variable_set("#{clines ? 'clines' : 'lines'}", permute(clines))
  self
end

#resample(clines = false, n = nil) ⇒ `Object`

Resample n lines WITH replacement from text (for bootstrapping). n = number of lines to resample; defaults to full size of corpus. If clines is true, use the compacted lines variable, otherwise use all lines.

# File 'lib/cass/document.rb', line 101

def resample(clines=false, n=nil)
  n = @lines.size if n.nil? or n > @lines.size
  max = @lines.size
  Array.new(n).map { |i| clines ? @clines[rand(max)] : @lines[rand(max)] }
end

#resample!(clines = false, n = nil) ⇒ `Object`

Same as resample, but replaces contents of current document.

# File 'lib/cass/document.rb', line 108

def resample!(clines=false, n=nil)
self.instance_variable_set("#{clines ? 'clines' : 'lines'}", resample(clines,n))
  self
end

#similarity(words) ⇒ `Object`

Return the requested subset of the similarity matrix. E.g., given the input [‘apple’, ‘orange’, ‘elephant’], a 3 x 3 pairwise similarity matrix will be returned.

# File 'lib/cass/document.rb', line 184

def similarity(words)
  ind = @tindex.values_at(*words)
  @corr[ind, ind]
end

#split(n = 10, recalc = true) ⇒ `Object`

Split Document into n smaller random subsets, recalculating the context each time. Currently not used.

# File 'lib/cass/document.rb', line 115

def split(n=10, recalc=true)
  permute!
  docs = []
  n.times { |i|
    text = @lines.slice!(0,(@lines.size.to_f/(n-i)).round)
    context = recalc ? nil : @context
    name = "#{@name}_split_#{(i+1)}"
    docs[i] = Document.new(name, @targets, text, context, false)
  }
  docs
end

#summary(filename = nil, list_context = false, word_count = false) ⇒ `Object`

Print out summary information about the document. Optional arguments:

filename: if not nil, will save the results to location provided instead of printing.
list_context: if true, will print out the entire list (typically several thousand words) of words in the context.
word_count: print list of number of tokens in the document for each word, by descending frequency rank. Targets will be printed first. Note that the token counts reflect only the lines used in analysis (i.e., those that contain at least one target), and NOT the entire document. If you need word counts for an entire document, call Stats.word_count directly.

# File 'lib/cass/document.rb', line 214

def summary(filename=nil, list_context=false, word_count=false)
  
  buffer = []
  
  # Basic info that always gets shown
  buffer << "Summary for document '#{@name}':"
  buffer << "#{@targets.size} target words (#{@targets.join (", ")})"
  buffer << "#{@context.words.size} words in context."
  buffer << "Using #{@clines.size} lines (containing at least one target word) for analysis."
  
  # Other options
  buffer << "\n\nList of all words in context:#{@context.words}" if list_context
  buffer << "\n\nNumber of tokens for all words included in analysis:\n#{Stats.word_count(@clines)}" if word_count
  
  filename.nil? ? puts(buffer) : File.new(filename, 'w').puts(buffer)
   
end

#trim! ⇒ `Object`

Trim internal list of lines, keeping only those that contain at least one target word.

# File 'lib/cass/document.rb', line 76

def trim!
  puts "Deleting target-less lines..." if VERBOSE
  ts = @targets.join("|")
  #@lines.delete_if { |s| (s.split(/\s+/) & @targets).empty? }  # another way to do it
  nl = @lines.size
  @lines = @lines.grep(/(^|\s+)(#{ts})($|\s+)/)
  puts "Keeping #{@lines.size} / #{nl} lines." if VERBOSE
  self   
end

Class: Cass::Document

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(name, targets, text, opts = {}) ⇒ Document

Instance Attribute Details

#clines ⇒ Object

#context ⇒ Object

#lines ⇒ Object

#name ⇒ Object

#targets ⇒ Object

#text ⇒ Object

#tindex ⇒ Object

#unique ⇒ Object

Instance Method Details

#compact ⇒ Object

#cooccurrence(normalize_weights = false) ⇒ Object

#pairwise_similarity(filename) ⇒ Object

#parse(opts = {}) ⇒ Object

#permute(clines = false) ⇒ Object

#permute!(clines = false) ⇒ Object

#resample(clines = false, n = nil) ⇒ Object

#resample!(clines = false, n = nil) ⇒ Object

#similarity(words) ⇒ Object

#split(n = 10, recalc = true) ⇒ Object

#summary(filename = nil, list_context = false, word_count = false) ⇒ Object

#trim! ⇒ Object