Class: Spacy::Doc

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/ruby-spacy.rb

Overview

See also spaCy Python API document for [Doc](spacy.io/api/doc).

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL) ⇒ Doc

It is recommended to use Language#read method to create a doc. If you need to create one using #initialize, there are two method signatures: ‘Spacy::Doc.new(nlp_id, py_doc: Object)` and `Spacy::Doc.new(nlp_id, text: String)`.

Parameters:

  • nlp (Language)

    an instance of Language class

  • py_doc (Object) (defaults to: nil)

    an instance of Python Doc class

  • text (String) (defaults to: nil)

    the text string to be analyzed



91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/ruby-spacy.rb', line 91

def initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL)
  @py_nlp = nlp
  retrial = 0
  begin
    @py_doc = py_doc || nlp.call(text)
    @text = @py_doc.text
  rescue StandardError
    retrial += 1
    raise "Error: Failed to construct a Doc object" unless retrial <= max_retrial

    sleep 0.5
    retry
  end
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args) ⇒ Object

Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.



485
486
487
# File 'lib/ruby-spacy.rb', line 485

def method_missing(name, *args)
  @py_doc.send(name, *args)
end

Instance Attribute Details

#py_docObject (readonly)

Returns a Python Doc instance accessible via PyCall.

Returns:

  • (Object)

    a Python Doc instance accessible via PyCall



74
75
76
# File 'lib/ruby-spacy.rb', line 74

def py_doc
  @py_doc
end

#py_nlpObject (readonly)

Returns a Python Language instance accessible via PyCall.

Returns:

  • (Object)

    a Python Language instance accessible via PyCall



71
72
73
# File 'lib/ruby-spacy.rb', line 71

def py_nlp
  @py_nlp
end

#textString (readonly)

Returns a text string of the document.

Returns:

  • (String)

    a text string of the document



77
78
79
# File 'lib/ruby-spacy.rb', line 77

def text
  @text
end

Class Method Details

.from_bytes(nlp, byte_string) ⇒ Doc

Restores a doc from binary data created by #to_bytes. This is useful for caching processed documents to avoid re-processing.

Examples:

Load doc from file

bytes = File.binread("doc.bin")
doc = Spacy::Doc.from_bytes(nlp, bytes)

Parameters:

  • byte_string (String)

    binary data from #to_bytes

Returns:

  • (Doc)

    the restored doc



235
236
237
238
239
240
# File 'lib/ruby-spacy.rb', line 235

def self.from_bytes(nlp, byte_string)
  b64 = Base64.strict_encode64(byte_string)
  py_bytes = PyCall.eval("__import__('base64').b64decode('#{b64}')")
  py_doc = nlp.py_nlp.call("").from_bytes(py_bytes)
  new(nlp.py_nlp, py_doc: py_doc)
end

Instance Method Details

#[](range) ⇒ Object

Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.

Parameters:

  • range (Range)

    an ordinary Ruby’s range object such as 0..3, 1...4, or ‘3 .. -1`



201
202
203
204
205
206
207
208
# File 'lib/ruby-spacy.rb', line 201

def [](range)
  if range.is_a?(Range)
    py_span = @py_doc[range]
    Span.new(self, start_index: py_span.start, end_index: py_span.end - 1)
  else
    Token.new(@py_doc[range])
  end
end

#displacy(style: "dep", compact: false) ⇒ String

Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).

Parameters:

  • style (String) (defaults to: "dep")

    either dep or ent

  • compact (Boolean) (defaults to: false)

    only relevant to the ‘dep’ style

Returns:

  • (String)

    in the case of dep, the output text will be an SVG, whereas in the ent style, the output text will be an HTML.



246
247
248
# File 'lib/ruby-spacy.rb', line 246

def displacy(style: "dep", compact: false)
  PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false)
end

#eachObject

Iterates over the elements in the doc yielding a token instance each time.



141
142
143
144
145
# File 'lib/ruby-spacy.rb', line 141

def each
  PyCall::List.call(@py_doc).each do |py_token|
    yield Token.new(py_token)
  end
end

#entsArray<Span>

Returns an array of spans each representing a named entity.

Returns:



193
194
195
196
197
# File 'lib/ruby-spacy.rb', line 193

def ents
  PyCall::List.call(@py_doc.ents).map do |py_span|
    Span.new(self, py_span: py_span)
  end
end

#instance_variables_to_inspectObject



493
494
495
# File 'lib/ruby-spacy.rb', line 493

def instance_variables_to_inspect
  [:@text]
end

#linguistic_summary(sections: [:text, :tokens, :entities, :noun_chunks], token_attributes: [:text, :lemma, :pos, :dep, :head]) ⇒ String

Generates a JSON string summarizing the linguistic analysis of the document. Designed to be passed as context to an LLM (e.g., via OpenAIHelper#chat).

Parameters:

  • sections (Array<Symbol>) (defaults to: [:text, :tokens, :entities, :noun_chunks])

    which sections to include (:text, :tokens, :entities, :noun_chunks, :sentences)

  • token_attributes (Array<Symbol>) (defaults to: [:text, :lemma, :pos, :dep, :head])

    which token attributes to include (:text, :lemma, :pos, :tag, :dep, :head, :ent_type, :morphology)

Returns:

  • (String)

    a JSON string of the linguistic summary



258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
# File 'lib/ruby-spacy.rb', line 258

def linguistic_summary(sections: [:text, :tokens, :entities, :noun_chunks],
                       token_attributes: [:text, :lemma, :pos, :dep, :head])
  result = {}

  sections.each do |section|
    case section
    when :text
      result[:text] = @text
    when :tokens
      result[:tokens] = tokens.map do |token|
        token_hash = {}
        token_attributes.each do |attr|
          case attr
          when :head
            token_hash[:head] = token.head.text
          when :morphology
            # Use string form and parse to ensure a plain Ruby Hash for JSON serialization
            morph_str = token.morphology(hash: false)
            token_hash[:morphology] = if morph_str.empty?
                                        {}
                                      else
                                        morph_str.split("|").each_with_object({}) do |pair, h|
                                          k, v = pair.split("=", 2)
                                          h[k] = v
                                        end
                                      end
          else
            token_hash[attr] = token.send(attr)
          end
        end
        token_hash
      end
    when :entities
      ent_list = ents
      result[:entities] = ent_list.map do |ent|
        { text: ent.text, label: ent.label }
      end
    when :noun_chunks
      result[:noun_chunks] = noun_chunks.map do |chunk|
        { text: chunk.text, root: chunk.root.text }
      end
    when :sentences
      result[:sentences] = sents.map(&:text)
    end
  end

  result.to_json
end

#noun_chunksArray<Span>

Returns an array of spans representing noun chunks.

Returns:



177
178
179
180
181
# File 'lib/ruby-spacy.rb', line 177

def noun_chunks
  PyCall::List.call(@py_doc.noun_chunks).map do |py_chunk|
    Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1)
  end
end

#openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini") ⇒ String?

Sends a text completion request to OpenAI’s chat API.

Parameters:

  • access_token (String, nil) (defaults to: nil)

    OpenAI API key (defaults to OPENAI_API_KEY env var)

  • max_completion_tokens (Integer) (defaults to: nil)

    Maximum tokens in the response

  • max_tokens (Integer) (defaults to: nil)

    Alias for max_completion_tokens (deprecated, for backward compatibility)

  • temperature (Float) (defaults to: 0.7)

    Sampling temperature (ignored for GPT-5 models)

  • model (String) (defaults to: "gpt-5-mini")

    The model to use (default: gpt-5-mini)

Returns:

  • (String, nil)

    The completed text



435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
# File 'lib/ruby-spacy.rb', line 435

def openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini")
  # Support both max_completion_tokens and max_tokens for backward compatibility
  max_completion_tokens ||= max_tokens || 1000

  messages = [
    { role: "system", content: "Complete the text input by the user." },
    { role: "user", content: @text }
  ]

  client = openai_client(access_token)
  response = client.chat(
    model: model,
    messages: messages,
    max_completion_tokens: max_completion_tokens,
    temperature: temperature
  )
  response.dig("choices", 0, "message", "content")
rescue OpenAIClient::APIError => e
  puts "Error: OpenAI API call failed - #{e.message}"
  nil
end

#openai_embeddings(access_token: nil, model: "text-embedding-3-small", dimensions: nil) ⇒ Array<Float>?

Generates text embeddings using OpenAI’s embeddings API.

Parameters:

  • access_token (String, nil) (defaults to: nil)

    OpenAI API key (defaults to OPENAI_API_KEY env var)

  • model (String) (defaults to: "text-embedding-3-small")

    The embeddings model (default: text-embedding-3-small)

  • dimensions (Integer, nil) (defaults to: nil)

    The number of dimensions for the output embeddings (nil uses model default)

Returns:

  • (Array<Float>, nil)

    The embedding vector



463
464
465
466
467
468
469
470
# File 'lib/ruby-spacy.rb', line 463

def openai_embeddings(access_token: nil, model: "text-embedding-3-small", dimensions: nil)
  client = openai_client(access_token)
  response = client.embeddings(model: model, input: @text, dimensions: dimensions)
  response.dig("data", 0, "embedding")
rescue OpenAIClient::APIError => e
  puts "Error: OpenAI API call failed - #{e.message}"
  nil
end

#openai_query(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini", messages: [], prompt: nil, response_format: nil, max_tool_call_depth: 5, _tool_call_depth: 0) ⇒ String?

Sends a query to OpenAI’s chat completion API with optional tool support. The get_tokens tool allows the model to request token-level linguistic analysis.

Parameters:

  • access_token (String, nil) (defaults to: nil)

    OpenAI API key (defaults to OPENAI_API_KEY env var)

  • max_completion_tokens (Integer) (defaults to: nil)

    Maximum tokens in the response

  • max_tokens (Integer) (defaults to: nil)

    Alias for max_completion_tokens (deprecated, for backward compatibility)

  • temperature (Float) (defaults to: 0.7)

    Sampling temperature (ignored for GPT-5 models)

  • model (String) (defaults to: "gpt-5-mini")

    The model to use (default: gpt-5-mini)

  • messages (Array<Hash>) (defaults to: [])

    Conversation history (for recursive tool calls). Note: this array is modified in place when tool calls occur.

  • prompt (String, nil) (defaults to: nil)

    System prompt for the query

Returns:

  • (String, nil)

    The model’s response content



318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
# File 'lib/ruby-spacy.rb', line 318

def openai_query(access_token: nil,
                 max_completion_tokens: nil,
                 max_tokens: nil,
                 temperature: 0.7,
                 model: "gpt-5-mini",
                 messages: [],
                 prompt: nil,
                 response_format: nil,
                 max_tool_call_depth: 5,
                 _tool_call_depth: 0)
  # Support both max_completion_tokens and max_tokens for backward compatibility
  max_completion_tokens ||= max_tokens || 1000
  if messages.empty?
    messages = [
      { role: "system", content: prompt },
      { role: "user", content: @text }
    ]
  end

  client = openai_client(access_token)

  # Tool definition for token analysis (GPT-5 tools API format)
  tools = nil
  tool_choice = nil
  if _tool_call_depth < max_tool_call_depth
    tools = [
      {
        type: "function",
        function: {
          name: "get_tokens",
          description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology",
          parameters: {
            type: "object",
            properties: {
              text: {
                type: "string",
                description: "text to be tokenized"
              }
            },
            required: ["text"]
          }
        }
      }
    ]
    tool_choice = "auto"
  end

  response = client.chat(
    model: model,
    messages: messages,
    max_completion_tokens: max_completion_tokens,
    temperature: temperature,
    tools: tools,
    tool_choice: tool_choice,
    response_format: response_format
  )

  message = response.dig("choices", 0, "message")

  # Handle tool calls (GPT-5 format)
  if message["tool_calls"] && !message["tool_calls"].empty?
    messages << message

    message["tool_calls"].each do |tool_call|
      function_name = tool_call.dig("function", "name")
      tool_call_id = tool_call["id"]

      case function_name
      when "get_tokens"
        result = tokens.map do |t|
          {
            surface: t.text,
            lemma: t.lemma,
            pos: t.pos,
            tag: t.tag,
            dep: t.dep,
            ent_type: t.ent_type,
            morphology: t.morphology
          }
        end.to_json

        messages << {
          role: "tool",
          tool_call_id: tool_call_id,
          content: result
        }
      end
    end

    # Recursive call to get final response after tool execution
    openai_query(
      access_token: access_token,
      max_completion_tokens: max_completion_tokens,
      temperature: temperature,
      model: model,
      messages: messages,
      prompt: prompt,
      response_format: response_format,
      max_tool_call_depth: max_tool_call_depth,
      _tool_call_depth: _tool_call_depth + 1
    )
  else
    message["content"]
  end
rescue OpenAIClient::APIError => e
  puts "Error: OpenAI API call failed - #{e.message}"
  nil
end

#respond_to_missing?(sym, include_private = false) ⇒ Boolean

Returns:

  • (Boolean)


489
490
491
# File 'lib/ruby-spacy.rb', line 489

def respond_to_missing?(sym, include_private = false)
  Spacy.py_hasattr?(@py_doc, sym) || super
end

#retokenize(start_index, end_index, attributes = {}) ⇒ Object

Retokenizes the text merging a span into a single token.

Parameters:

  • start_index (Integer)

    the start position of the span to be retokenized in the document

  • end_index (Integer)

    the end position of the span to be retokenized in the document

  • attributes (Hash) (defaults to: {})

    attributes to set on the merged token



110
111
112
113
114
# File 'lib/ruby-spacy.rb', line 110

def retokenize(start_index, end_index, attributes = {})
  PyCall.with(@py_doc.retokenize) do |retokenizer|
    retokenizer.merge(@py_doc[start_index..end_index], attrs: attributes)
  end
end

#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object

Retokenizes the text splitting the specified token.

Parameters:

  • pos_in_doc (Integer)

    the position of the span to be retokenized in the document

  • split_array (Array<String>)

    text strings of the split results

  • ancestor_pos (Integer)

    the position of the immediate ancestor element of the split elements in the document

  • attributes (Hash) (defaults to: {})

    the attributes of the split elements



121
122
123
124
125
126
# File 'lib/ruby-spacy.rb', line 121

def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {})
  PyCall.with(@py_doc.retokenize) do |retokenizer|
    heads = [[@py_doc[pos_in_doc], head_pos_in_split], @py_doc[ancestor_pos]]
    retokenizer.split(@py_doc[pos_in_doc], split_array, heads: heads, attrs: attributes)
  end
end

#sentsArray<Span>

Returns an array of spans each representing a sentence.

Returns:



185
186
187
188
189
# File 'lib/ruby-spacy.rb', line 185

def sents
  PyCall::List.call(@py_doc.sents).map do |py_sent|
    Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1)
  end
end

#similarity(other) ⇒ Float

Returns a semantic similarity estimate.

Parameters:

  • other (Doc)

    the other doc to which a similarity estimation is made

Returns:

  • (Float)


213
214
215
# File 'lib/ruby-spacy.rb', line 213

def similarity(other)
  py_doc.similarity(other.py_doc)
end

#span(range_or_start, optional_size = nil) ⇒ Span

Returns a span of the specified range within the doc. The method should be used either of the two ways: ‘Doc#span(range)` or `Doc#spansize_of_span`.

Parameters:

  • range_or_start (Range, Integer)

    a range object, or, alternatively, an integer that represents the start position of the span

  • optional_size (Integer) (defaults to: nil)

    an integer representing the size of the span

Returns:



152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# File 'lib/ruby-spacy.rb', line 152

def span(range_or_start, optional_size = nil)
  doc_len = PyCall.len(@py_doc)

  if optional_size
    start_index = range_or_start
    start_index += doc_len if start_index < 0
    end_index = start_index + optional_size - 1
  else
    range = range_or_start
    start_index = range.first
    start_index += doc_len if start_index < 0
    end_val = range.end
    if end_val.nil?
      end_index = doc_len - 1
    else
      end_val += doc_len if end_val < 0
      end_index = range.exclude_end? ? end_val - 1 : end_val
    end
  end

  Span.new(self, start_index: start_index, end_index: end_index)
end

#to_bytesString

Serializes the doc to a binary string. The binary data includes all annotations (tokens, entities, etc.) and can be used to restore the doc later without re-processing.

Examples:

Save doc to file

doc = nlp.read("Hello world")
File.binwrite("doc.bin", doc.to_bytes)

Returns:

  • (String)

    binary representation of the doc



224
225
226
# File 'lib/ruby-spacy.rb', line 224

def to_bytes
  @py_doc.to_bytes.force_encoding(Encoding::BINARY)
end

#to_sString

String representation of the document.

Returns:

  • (String)


130
131
132
# File 'lib/ruby-spacy.rb', line 130

def to_s
  @text
end

#tokensArray<Token>

Returns an array of tokens contained in the doc.

Returns:



136
137
138
# File 'lib/ruby-spacy.rb', line 136

def tokens
  PyCall::List.call(@py_doc).map { |py_token| Token.new(py_token) }
end