Class: Cass::Document
- Inherits:
-
Object
- Object
- Cass::Document
- Defined in:
- lib/cass/document.rb
Overview
A Document object represents a single document– can be either an entire file, or a subset.
Instance Attribute Summary collapse
-
#clines ⇒ Object
Returns the value of attribute clines.
-
#context ⇒ Object
Returns the value of attribute context.
-
#lines ⇒ Object
Returns the value of attribute lines.
-
#name ⇒ Object
Returns the value of attribute name.
-
#targets ⇒ Object
Returns the value of attribute targets.
-
#text ⇒ Object
Returns the value of attribute text.
-
#tindex ⇒ Object
Returns the value of attribute tindex.
-
#unique ⇒ Object
Returns the value of attribute unique.
Instance Method Summary collapse
-
#compact ⇒ Object
Drop all words that aren’t in target list or context.
-
#cooccurrence(normalize_weights = false) ⇒ Object
Computes co-occurrence matrix between target words and the context.
-
#initialize(name, targets, text, opts = {}) ⇒ Document
constructor
Create a new Document.
-
#pairwise_similarity(filename) ⇒ Object
Computes the pairwise similarity between all possible target pairs and saves teh results to the specified file.
-
#parse(opts = {}) ⇒ Object
Parse raw text into sentences.
-
#permute(clines = false) ⇒ Object
Randomly reorder lines of text.
-
#permute!(clines = false) ⇒ Object
Same as permute, but replaces contents of current document.
-
#resample(clines = false, n = nil) ⇒ Object
Resample n lines WITH replacement from text (for bootstrapping).
-
#resample!(clines = false, n = nil) ⇒ Object
Same as resample, but replaces contents of current document.
-
#similarity(words) ⇒ Object
Return the requested subset of the similarity matrix.
-
#split(n = 10, recalc = true) ⇒ Object
Split Document into n smaller random subsets, recalculating the context each time.
-
#summary(filename = nil, list_context = false, word_count = false) ⇒ Object
Print out summary information about the document.
-
#trim! ⇒ Object
Trim internal list of lines, keeping only those that contain at least one target word.
Constructor Details
#initialize(name, targets, text, opts = {}) ⇒ Document
Create a new Document. Three arguments are required:
-
name: The name of the document (defaults to the filename)
-
targets: Either an array of target words, or an array of Contrasts from which targets will be extracted
-
text: A string of text (the contents of the document)
The following (optional) settings can be passed in an options hash as the fourth argument:
-
context: A Context object to use (by default, a new one will be constructed)
-
skip_preproc: Skip most text preprocessing steps. This should only ever be used when creating a document derived from another document, where the text has already been processed.
-
max_lines: Maximum number of lines to use from the provided text. Note that this limit applies to the number of lines in the input text, NOT the number retained for analysis. By default, will use all lines.
-
recode: a hash containing words to recode in the text prior to analysis. For instance, if the key=>value pair ‘liberal’=>‘democrat’ is passed, all occurrences of ‘liberal’ will be replaced with ‘democrat’. This is useful when you want to analyze several words together as a single category, or for combining singular and plural forms of a word.
-
keep_case: By default, all words will be converted to lowercase. Passing this key in the options hash will preserve case in the text. Note that this will cause different cases of the same word to be handled as different words.
-
keep_special: By default, all non-alphabetical characters will be removed. Use this flag if you want special characters to be retained, with the same caveat as for keep_case.
-
parse_text: By default, it’s assumed that the text is already broken up into sentences at desired boundaries (one sentence per line). If the parse_text key is passed, a parser will be called. Note that USING THIS OPTION IS NOT RECOMMENDED. You should generally preprocess the input text yourself to ensure it looks right before submitting it to analysis.
-
parser_basic: If parse_text is on, the Parser will try to call the Stanford Parser by default. If the Stanford Parser isn’t installed properly, a backup basic parser will be used. Including the parser_basic flag will automatically revert to the basic parser instead of attempting to use the Stanford parser.
-
parser_regex: A custom regular expression that will be handed to the basic parser. Lines will be split at matches to the regex instead of the default (splitting only at newlines and periods). Note that parser_basic and parse_text must both be passed for this to work.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/cass/document.rb', line 24 def initialize(name, targets, text, opts={}) # Error checking... if name.nil? abort("Error: document has no name!") elsif targets.nil? or targets.class != Array or targets.empty? abort("Error: invalid target specification; targets must be an array of words or Contrasts.") elsif text.nil? abort("Error: no text provided!") end # Set/initialize instance variables @name, @text, @tindex = name, text, {} # Get list of words from contrasts if necessary @targets = if targets[0].class == Contrast targets = contrasts.inject([]) { |t, c| t += c.words.flatten }.uniq else targets end # Index targets, parse text, and create Context @targets.each_index { |i| @tindex[@targets[i]] = i } parse(opts) @context = context.nil? ? Context.new(self, opts) : @context end |
Instance Attribute Details
#clines ⇒ Object
Returns the value of attribute clines.
7 8 9 |
# File 'lib/cass/document.rb', line 7 def clines @clines end |
#context ⇒ Object
Returns the value of attribute context.
7 8 9 |
# File 'lib/cass/document.rb', line 7 def context @context end |
#lines ⇒ Object
Returns the value of attribute lines.
7 8 9 |
# File 'lib/cass/document.rb', line 7 def lines @lines end |
#name ⇒ Object
Returns the value of attribute name.
7 8 9 |
# File 'lib/cass/document.rb', line 7 def name @name end |
#targets ⇒ Object
Returns the value of attribute targets.
7 8 9 |
# File 'lib/cass/document.rb', line 7 def targets @targets end |
#text ⇒ Object
Returns the value of attribute text.
7 8 9 |
# File 'lib/cass/document.rb', line 7 def text @text end |
#tindex ⇒ Object
Returns the value of attribute tindex.
7 8 9 |
# File 'lib/cass/document.rb', line 7 def tindex @tindex end |
#unique ⇒ Object
Returns the value of attribute unique.
7 8 9 |
# File 'lib/cass/document.rb', line 7 def unique @unique end |
Instance Method Details
#compact ⇒ Object
Drop all words that aren’t in target list or context. Store as an array of arrays, with first element = array of targets and second = array of context words.
147 148 149 150 151 152 153 154 155 156 |
# File 'lib/cass/document.rb', line 147 def compact puts "Compacting all lines..." if VERBOSE @clines = [] @lines.each { |l| w = l.split(/\s+/).uniq targs = w.select { |s| @tindex.key?(s) } conts = w.delete_if { |s| !@context.key?(s) } @clines << [targs, conts] } end |
#cooccurrence(normalize_weights = false) ⇒ Object
Computes co-occurrence matrix between target words and the context. Stores a target-by-context integer matrix internally.
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# File 'lib/cass/document.rb', line 160 def cooccurrence(normalize_weights=false) # puts "Generating co-occurrence matrix..." if VERBOSE coocc = NMatrix.float(@targets.size, @context.size) compact if @clines.nil? lc = 0 # line counter @clines.each { |l| targs, conts = l targs.each { |t| conts.each { |c| next if t == c incr = normalize_weights ? 1.0/conts.size : 1 coocc[@tindex[t], @context[c]] = coocc[@tindex[t], @context[c]].to_f + incr } } } @cooc = coocc #.to_f @corr = @cooc.corr#.collect { |i| i*i } # Uncomment second half of line to square coefficients. #p @corr.to_a self end |
#pairwise_similarity(filename) ⇒ Object
Computes the pairwise similarity between all possible target pairs and saves teh results to the specified file. Note that this will produce an unmanageably large file if the number of targets is very large! The returned string contains tab-delimited columns for:
-
Document name
-
First word in pair
-
Second word in pair
-
Similarity value (correlation)
198 199 200 201 202 203 204 205 206 207 |
# File 'lib/cass/document.rb', line 198 def pairwise_similarity(filename) abort("Error: you must compute the similarity matrix first!") if @corr.nil? outf = File.new(filename, 'w') outf.sync = true ind = @tindex.invert # For looking up words outf.puts %w[Document Word1 Word2 Correlation].join("\t") (dim = @corr.shape[0]).times { |i| i.upto(dim-1) { |j| outf.puts [@name, ind[i], ind[j], @corr[i,j]] } } end |
#parse(opts = {}) ⇒ Object
Parse raw text into sentences. When orig == false (e.g., when bootstrapping or splitting a document), skip certain preprocessing steps, on the assumption that these have already been performed.
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/cass/document.rb', line 56 def parse(opts={}) if opts.key?('skip_preproc') @lines = (text.class == Array) ? @text : text.split(/[\r\n]+/) else puts "Converting to lowercase..." if VERBOSE @text.downcase! unless opts.key?('keep_case') @text.gsub!(/[^a-z \n]+/, '') unless opts.key('keep_special') if opts.key?('recode') puts "Recoding words..." if VERBOSE opts['recode'].each { |k,v| @text.gsub!(/(^|\s+)(#{k})($|\s+)/, "\\1#{v}\\3") } end puts "Parsing text..." if VERBOSE @lines = opts.key?('parse_text') ? Parser.parse(@text, opts) : @text.split(/[\r\n]+/) @lines = @lines[0, opts['max_lines']] if opts.key?('max_lines') trim! end end |
#permute(clines = false) ⇒ Object
Randomly reorder lines of text. If clines is true, use the compacted lines variable, otherwise use all lines.
88 89 90 |
# File 'lib/cass/document.rb', line 88 def permute(clines=false) clines ? @clines.sort_by {rand} : @lines.sort_by {rand} end |
#permute!(clines = false) ⇒ Object
Same as permute, but replaces contents of current document.
93 94 95 96 |
# File 'lib/cass/document.rb', line 93 def permute!(clines=false) self.instance_variable_set("#{clines ? 'clines' : 'lines'}", permute(clines)) self end |
#resample(clines = false, n = nil) ⇒ Object
Resample n lines WITH replacement from text (for bootstrapping). n = number of lines to resample; defaults to full size of corpus. If clines is true, use the compacted lines variable, otherwise use all lines.
101 102 103 104 105 |
# File 'lib/cass/document.rb', line 101 def resample(clines=false, n=nil) n = @lines.size if n.nil? or n > @lines.size max = @lines.size Array.new(n).map { |i| clines ? @clines[rand(max)] : @lines[rand(max)] } end |
#resample!(clines = false, n = nil) ⇒ Object
Same as resample, but replaces contents of current document.
108 109 110 111 |
# File 'lib/cass/document.rb', line 108 def resample!(clines=false, n=nil) self.instance_variable_set("#{clines ? 'clines' : 'lines'}", resample(clines,n)) self end |
#similarity(words) ⇒ Object
Return the requested subset of the similarity matrix. E.g., given the input [‘apple’, ‘orange’, ‘elephant’], a 3 x 3 pairwise similarity matrix will be returned.
184 185 186 187 |
# File 'lib/cass/document.rb', line 184 def similarity(words) ind = @tindex.values_at(*words) @corr[ind, ind] end |
#split(n = 10, recalc = true) ⇒ Object
Split Document into n smaller random subsets, recalculating the context each time. Currently not used.
115 116 117 118 119 120 121 122 123 124 125 |
# File 'lib/cass/document.rb', line 115 def split(n=10, recalc=true) permute! docs = [] n.times { |i| text = @lines.slice!(0,(@lines.size.to_f/(n-i)).round) context = recalc ? nil : @context name = "#{@name}_split_#{(i+1)}" docs[i] = Document.new(name, @targets, text, context, false) } docs end |
#summary(filename = nil, list_context = false, word_count = false) ⇒ Object
Print out summary information about the document. Optional arguments:
-
filename: if not nil, will save the results to location provided instead of printing.
-
list_context: if true, will print out the entire list (typically several thousand words) of words in the context.
-
word_count: print list of number of tokens in the document for each word, by descending frequency rank. Targets will be printed first. Note that the token counts reflect only the lines used in analysis (i.e., those that contain at least one target), and NOT the entire document. If you need word counts for an entire document, call Stats.word_count directly.
214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 |
# File 'lib/cass/document.rb', line 214 def summary(filename=nil, list_context=false, word_count=false) buffer = [] # Basic info that always gets shown buffer << "Summary for document '#{@name}':" buffer << "#{@targets.size} target words (#{@targets.join (", ")})" buffer << "#{@context.words.size} words in context." buffer << "Using #{@clines.size} lines (containing at least one target word) for analysis." # Other options buffer << "\n\nList of all words in context:#{@context.words}" if list_context buffer << "\n\nNumber of tokens for all words included in analysis:\n#{Stats.word_count(@clines)}" if word_count filename.nil? ? puts(buffer) : File.new(filename, 'w').puts(buffer) end |
#trim! ⇒ Object
Trim internal list of lines, keeping only those that contain at least one target word.
76 77 78 79 80 81 82 83 84 |
# File 'lib/cass/document.rb', line 76 def trim! puts "Deleting target-less lines..." if VERBOSE ts = @targets.join("|") #@lines.delete_if { |s| (s.split(/\s+/) & @targets).empty? } # another way to do it nl = @lines.size @lines = @lines.grep(/(^|\s+)(#{ts})($|\s+)/) puts "Keeping #{@lines.size} / #{nl} lines." if VERBOSE self end |