Class: Spacy::Doc
Overview
See also spaCy Python API document for [Doc](spacy.io/api/doc).
Instance Attribute Summary collapse
-
#py_doc ⇒ Object
readonly
A Python
Docinstance accessible viaPyCall. -
#py_nlp ⇒ Object
readonly
A Python
Languageinstance accessible viaPyCall. -
#text ⇒ String
readonly
A text string of the document.
Class Method Summary collapse
-
.from_bytes(nlp, byte_string) ⇒ Doc
Restores a doc from binary data created by #to_bytes.
Instance Method Summary collapse
-
#[](range) ⇒ Object
Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
-
#displacy(style: "dep", compact: false) ⇒ String
Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
-
#each ⇒ Object
Iterates over the elements in the doc yielding a token instance each time.
-
#ents ⇒ Array<Span>
Returns an array of spans each representing a named entity.
-
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL) ⇒ Doc
constructor
It is recommended to use Language#read method to create a doc.
- #instance_variables_to_inspect ⇒ Object
-
#linguistic_summary(sections: [:text, :tokens, :entities, :noun_chunks], token_attributes: [:text, :lemma, :pos, :dep, :head]) ⇒ String
Generates a JSON string summarizing the linguistic analysis of the document.
-
#method_missing(name, *args) ⇒ Object
Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
-
#noun_chunks ⇒ Array<Span>
Returns an array of spans representing noun chunks.
-
#openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini") ⇒ String?
Sends a text completion request to OpenAI’s chat API.
-
#openai_embeddings(access_token: nil, model: "text-embedding-3-small", dimensions: nil) ⇒ Array<Float>?
Generates text embeddings using OpenAI’s embeddings API.
-
#openai_query(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini", messages: [], prompt: nil, response_format: nil, max_tool_call_depth: 5, _tool_call_depth: 0) ⇒ String?
Sends a query to OpenAI’s chat completion API with optional tool support.
- #respond_to_missing?(sym, include_private = false) ⇒ Boolean
-
#retokenize(start_index, end_index, attributes = {}) ⇒ Object
Retokenizes the text merging a span into a single token.
-
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object
Retokenizes the text splitting the specified token.
-
#sents ⇒ Array<Span>
Returns an array of spans each representing a sentence.
-
#similarity(other) ⇒ Float
Returns a semantic similarity estimate.
-
#span(range_or_start, optional_size = nil) ⇒ Span
Returns a span of the specified range within the doc.
-
#to_bytes ⇒ String
Serializes the doc to a binary string.
-
#to_s ⇒ String
String representation of the document.
-
#tokens ⇒ Array<Token>
Returns an array of tokens contained in the doc.
Constructor Details
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL) ⇒ Doc
It is recommended to use Language#read method to create a doc. If you need to create one using #initialize, there are two method signatures: ‘Spacy::Doc.new(nlp_id, py_doc: Object)` and `Spacy::Doc.new(nlp_id, text: String)`.
91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/ruby-spacy.rb', line 91 def initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL) @py_nlp = nlp retrial = 0 begin @py_doc = py_doc || nlp.call(text) @text = @py_doc.text rescue StandardError retrial += 1 raise "Error: Failed to construct a Doc object" unless retrial <= max_retrial sleep 0.5 retry end end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(name, *args) ⇒ Object
Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
485 486 487 |
# File 'lib/ruby-spacy.rb', line 485 def method_missing(name, *args) @py_doc.send(name, *args) end |
Instance Attribute Details
#py_doc ⇒ Object (readonly)
Returns a Python Doc instance accessible via PyCall.
74 75 76 |
# File 'lib/ruby-spacy.rb', line 74 def py_doc @py_doc end |
#py_nlp ⇒ Object (readonly)
Returns a Python Language instance accessible via PyCall.
71 72 73 |
# File 'lib/ruby-spacy.rb', line 71 def py_nlp @py_nlp end |
#text ⇒ String (readonly)
Returns a text string of the document.
77 78 79 |
# File 'lib/ruby-spacy.rb', line 77 def text @text end |
Class Method Details
.from_bytes(nlp, byte_string) ⇒ Doc
Restores a doc from binary data created by #to_bytes. This is useful for caching processed documents to avoid re-processing.
235 236 237 238 239 240 |
# File 'lib/ruby-spacy.rb', line 235 def self.from_bytes(nlp, byte_string) b64 = Base64.strict_encode64(byte_string) py_bytes = PyCall.eval("__import__('base64').b64decode('#{b64}')") py_doc = nlp.py_nlp.call("").from_bytes(py_bytes) new(nlp.py_nlp, py_doc: py_doc) end |
Instance Method Details
#[](range) ⇒ Object
Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
201 202 203 204 205 206 207 208 |
# File 'lib/ruby-spacy.rb', line 201 def [](range) if range.is_a?(Range) py_span = @py_doc[range] Span.new(self, start_index: py_span.start, end_index: py_span.end - 1) else Token.new(@py_doc[range]) end end |
#displacy(style: "dep", compact: false) ⇒ String
Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
246 247 248 |
# File 'lib/ruby-spacy.rb', line 246 def displacy(style: "dep", compact: false) PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false) end |
#each ⇒ Object
Iterates over the elements in the doc yielding a token instance each time.
141 142 143 144 145 |
# File 'lib/ruby-spacy.rb', line 141 def each PyCall::List.call(@py_doc).each do |py_token| yield Token.new(py_token) end end |
#ents ⇒ Array<Span>
Returns an array of spans each representing a named entity.
193 194 195 196 197 |
# File 'lib/ruby-spacy.rb', line 193 def ents PyCall::List.call(@py_doc.ents).map do |py_span| Span.new(self, py_span: py_span) end end |
#instance_variables_to_inspect ⇒ Object
493 494 495 |
# File 'lib/ruby-spacy.rb', line 493 def instance_variables_to_inspect [:@text] end |
#linguistic_summary(sections: [:text, :tokens, :entities, :noun_chunks], token_attributes: [:text, :lemma, :pos, :dep, :head]) ⇒ String
Generates a JSON string summarizing the linguistic analysis of the document. Designed to be passed as context to an LLM (e.g., via OpenAIHelper#chat).
258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 |
# File 'lib/ruby-spacy.rb', line 258 def linguistic_summary(sections: [:text, :tokens, :entities, :noun_chunks], token_attributes: [:text, :lemma, :pos, :dep, :head]) result = {} sections.each do |section| case section when :text result[:text] = @text when :tokens result[:tokens] = tokens.map do |token| token_hash = {} token_attributes.each do |attr| case attr when :head token_hash[:head] = token.head.text when :morphology # Use string form and parse to ensure a plain Ruby Hash for JSON serialization morph_str = token.morphology(hash: false) token_hash[:morphology] = if morph_str.empty? {} else morph_str.split("|").each_with_object({}) do |pair, h| k, v = pair.split("=", 2) h[k] = v end end else token_hash[attr] = token.send(attr) end end token_hash end when :entities ent_list = ents result[:entities] = ent_list.map do |ent| { text: ent.text, label: ent.label } end when :noun_chunks result[:noun_chunks] = noun_chunks.map do |chunk| { text: chunk.text, root: chunk.root.text } end when :sentences result[:sentences] = sents.map(&:text) end end result.to_json end |
#noun_chunks ⇒ Array<Span>
Returns an array of spans representing noun chunks.
177 178 179 180 181 |
# File 'lib/ruby-spacy.rb', line 177 def noun_chunks PyCall::List.call(@py_doc.noun_chunks).map do |py_chunk| Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1) end end |
#openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini") ⇒ String?
Sends a text completion request to OpenAI’s chat API.
435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 |
# File 'lib/ruby-spacy.rb', line 435 def openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini") # Support both max_completion_tokens and max_tokens for backward compatibility max_completion_tokens ||= max_tokens || 1000 = [ { role: "system", content: "Complete the text input by the user." }, { role: "user", content: @text } ] client = openai_client(access_token) response = client.chat( model: model, messages: , max_completion_tokens: max_completion_tokens, temperature: temperature ) response.dig("choices", 0, "message", "content") rescue OpenAIClient::APIError => e puts "Error: OpenAI API call failed - #{e.message}" nil end |
#openai_embeddings(access_token: nil, model: "text-embedding-3-small", dimensions: nil) ⇒ Array<Float>?
Generates text embeddings using OpenAI’s embeddings API.
463 464 465 466 467 468 469 470 |
# File 'lib/ruby-spacy.rb', line 463 def (access_token: nil, model: "text-embedding-3-small", dimensions: nil) client = openai_client(access_token) response = client.(model: model, input: @text, dimensions: dimensions) response.dig("data", 0, "embedding") rescue OpenAIClient::APIError => e puts "Error: OpenAI API call failed - #{e.message}" nil end |
#openai_query(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini", messages: [], prompt: nil, response_format: nil, max_tool_call_depth: 5, _tool_call_depth: 0) ⇒ String?
Sends a query to OpenAI’s chat completion API with optional tool support. The get_tokens tool allows the model to request token-level linguistic analysis.
318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 |
# File 'lib/ruby-spacy.rb', line 318 def openai_query(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini", messages: [], prompt: nil, response_format: nil, max_tool_call_depth: 5, _tool_call_depth: 0) # Support both max_completion_tokens and max_tokens for backward compatibility max_completion_tokens ||= max_tokens || 1000 if .empty? = [ { role: "system", content: prompt }, { role: "user", content: @text } ] end client = openai_client(access_token) # Tool definition for token analysis (GPT-5 tools API format) tools = nil tool_choice = nil if _tool_call_depth < max_tool_call_depth tools = [ { type: "function", function: { name: "get_tokens", description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology", parameters: { type: "object", properties: { text: { type: "string", description: "text to be tokenized" } }, required: ["text"] } } } ] tool_choice = "auto" end response = client.chat( model: model, messages: , max_completion_tokens: max_completion_tokens, temperature: temperature, tools: tools, tool_choice: tool_choice, response_format: response_format ) = response.dig("choices", 0, "message") # Handle tool calls (GPT-5 format) if ["tool_calls"] && !["tool_calls"].empty? << ["tool_calls"].each do |tool_call| function_name = tool_call.dig("function", "name") tool_call_id = tool_call["id"] case function_name when "get_tokens" result = tokens.map do |t| { surface: t.text, lemma: t.lemma, pos: t.pos, tag: t.tag, dep: t.dep, ent_type: t.ent_type, morphology: t.morphology } end.to_json << { role: "tool", tool_call_id: tool_call_id, content: result } end end # Recursive call to get final response after tool execution openai_query( access_token: access_token, max_completion_tokens: max_completion_tokens, temperature: temperature, model: model, messages: , prompt: prompt, response_format: response_format, max_tool_call_depth: max_tool_call_depth, _tool_call_depth: _tool_call_depth + 1 ) else ["content"] end rescue OpenAIClient::APIError => e puts "Error: OpenAI API call failed - #{e.message}" nil end |
#respond_to_missing?(sym, include_private = false) ⇒ Boolean
489 490 491 |
# File 'lib/ruby-spacy.rb', line 489 def respond_to_missing?(sym, include_private = false) Spacy.py_hasattr?(@py_doc, sym) || super end |
#retokenize(start_index, end_index, attributes = {}) ⇒ Object
Retokenizes the text merging a span into a single token.
110 111 112 113 114 |
# File 'lib/ruby-spacy.rb', line 110 def retokenize(start_index, end_index, attributes = {}) PyCall.with(@py_doc.retokenize) do |retokenizer| retokenizer.merge(@py_doc[start_index..end_index], attrs: attributes) end end |
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object
Retokenizes the text splitting the specified token.
121 122 123 124 125 126 |
# File 'lib/ruby-spacy.rb', line 121 def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) PyCall.with(@py_doc.retokenize) do |retokenizer| heads = [[@py_doc[pos_in_doc], head_pos_in_split], @py_doc[ancestor_pos]] retokenizer.split(@py_doc[pos_in_doc], split_array, heads: heads, attrs: attributes) end end |
#sents ⇒ Array<Span>
Returns an array of spans each representing a sentence.
185 186 187 188 189 |
# File 'lib/ruby-spacy.rb', line 185 def sents PyCall::List.call(@py_doc.sents).map do |py_sent| Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1) end end |
#similarity(other) ⇒ Float
Returns a semantic similarity estimate.
213 214 215 |
# File 'lib/ruby-spacy.rb', line 213 def similarity(other) py_doc.similarity(other.py_doc) end |
#span(range_or_start, optional_size = nil) ⇒ Span
Returns a span of the specified range within the doc. The method should be used either of the two ways: ‘Doc#span(range)` or `Doc#spansize_of_span`.
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
# File 'lib/ruby-spacy.rb', line 152 def span(range_or_start, optional_size = nil) doc_len = PyCall.len(@py_doc) if optional_size start_index = range_or_start start_index += doc_len if start_index < 0 end_index = start_index + optional_size - 1 else range = range_or_start start_index = range.first start_index += doc_len if start_index < 0 end_val = range.end if end_val.nil? end_index = doc_len - 1 else end_val += doc_len if end_val < 0 end_index = range.exclude_end? ? end_val - 1 : end_val end end Span.new(self, start_index: start_index, end_index: end_index) end |
#to_bytes ⇒ String
Serializes the doc to a binary string. The binary data includes all annotations (tokens, entities, etc.) and can be used to restore the doc later without re-processing.
224 225 226 |
# File 'lib/ruby-spacy.rb', line 224 def to_bytes @py_doc.to_bytes.force_encoding(Encoding::BINARY) end |
#to_s ⇒ String
String representation of the document.
130 131 132 |
# File 'lib/ruby-spacy.rb', line 130 def to_s @text end |
#tokens ⇒ Array<Token>
Returns an array of tokens contained in the doc.
136 137 138 |
# File 'lib/ruby-spacy.rb', line 136 def tokens PyCall::List.call(@py_doc).map { |py_token| Token.new(py_token) } end |