Class: Spacy::Doc

Inherits:

Object

Object
Spacy::Doc

show all

Includes:: Enumerable

Defined in:: lib/ruby-spacy.rb

Overview

See also spaCy Python API document for [‘Doc`](spacy.io/api/doc).

Instance Attribute Summary collapse

#py_doc ⇒ Object readonly

A Python ‘Doc` instance accessible via `PyCall`.
#py_nlp ⇒ Object readonly

A Python ‘Language` instance accessible via `PyCall`.
#text ⇒ String readonly

A text string of the document.

Instance Method Summary collapse

#[](range) ⇒ Object

Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
#displacy(style: "dep", compact: false) ⇒ String

Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
#each ⇒ Object

Iterates over the elements in the doc yielding a token instance each time.
#ents ⇒ Array

Returns an array of spans each representing a named entity.
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc constructor

It is recommended to use Language#read method to create a doc.
#method_missing(name, *args) ⇒ Object

Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
#noun_chunks ⇒ Array

Returns an array of spans representing noun chunks.
#openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini") ⇒ Object
#openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ Object
#openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini", messages: [], prompt: nil) ⇒ Object
#respond_to_missing?(sym) ⇒ Boolean
#retokenize(start_index, end_index, attributes = {}) ⇒ Object

Retokenizes the text merging a span into a single token.
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object

Retokenizes the text splitting the specified token.
#sents ⇒ Array

Returns an array of spans each representing a sentence.
#similarity(other) ⇒ Float

Returns a semantic similarity estimate.
#span(range_or_start, optional_size = nil) ⇒ Span

Returns a span of the specified range within the doc.
#to_s ⇒ String

String representation of the document.
#tokens ⇒ Array<Token>

Returns an array of tokens contained in the doc.

Constructor Details

#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ `Doc`

It is recommended to use Language#read method to create a doc. If you need to create one using #initialize, there are two method signatures: ‘Spacy::Doc.new(nlp_id, py_doc: Object)` and `Spacy::Doc.new(nlp_id, text: String)`.

Parameters:

nlp (Language) —

an instance of Language class
py_doc (Object) (defaults to: nil) —

an instance of Python ‘Doc` class
text (String) (defaults to: nil) —

the text string to be analyzed

# File 'lib/ruby-spacy.rb', line 87

def initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL,
               retrial: 0)
  @py_nlp = nlp
  @py_doc = py_doc || @py_doc = nlp.call(text)
  @text = @py_doc.text
rescue StandardError
  retrial += 1
  raise "Error: Failed to construct a Doc object" unless retrial <= max_retrial

  sleep 0.5
  initialize(nlp, py_doc: py_doc, text: text, max_retrial: max_retrial, retrial: retrial)
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args) ⇒ `Object`

Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.



350
351
352

# File 'lib/ruby-spacy.rb', line 350

def method_missing(name, *args)
  @py_doc.send(name, *args)
end

Instance Attribute Details

#py_doc ⇒ `Object` (readonly)

Returns a Python ‘Doc` instance accessible via `PyCall`.

Returns:

(Object) —

a Python ‘Doc` instance accessible via `PyCall`



70
71
72

# File 'lib/ruby-spacy.rb', line 70

def py_doc
  @py_doc
end

#py_nlp ⇒ `Object` (readonly)

Returns a Python ‘Language` instance accessible via `PyCall`.

Returns:

(Object) —

a Python ‘Language` instance accessible via `PyCall`



67
68
69

# File 'lib/ruby-spacy.rb', line 67

def py_nlp
  @py_nlp
end

#text ⇒ `String` (readonly)

Returns a text string of the document.

Returns:

(String) —

a text string of the document



73
74
75

# File 'lib/ruby-spacy.rb', line 73

def text
  @text
end

Instance Method Details

#[](range) ⇒ `Object`

Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.

Parameters:

range (Range) —

an ordinary Ruby’s range object such as ‘0..3`, `1…4`, or `3 .. -1`

# File 'lib/ruby-spacy.rb', line 203

def [](range)
  if range.is_a?(Range)
    py_span = @py_doc[range]
    Span.new(self, start_index: py_span.start, end_index: py_span.end - 1)
  else
    Token.new(@py_doc[range])
  end
end

#displacy(style: "dep", compact: false) ⇒ `String`

Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).

Parameters:

style (String) (defaults to: "dep") —

either ‘dep` or `ent`
compact (Boolean) (defaults to: false) —

only relevant to the ‘dep’ style

Returns:

(String) —

in the case of ‘dep`, the output text will be an SVG, whereas in the `ent` style, the output text will be an HTML.



223
224
225

# File 'lib/ruby-spacy.rb', line 223

def displacy(style: "dep", compact: false)
  PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false)
end

#each ⇒ `Object`

Iterates over the elements in the doc yielding a token instance each time.

# File 'lib/ruby-spacy.rb', line 139

def each
  PyCall::List.call(@py_doc).each do |py_token|
    yield Token.new(py_token)
  end
end

#ents ⇒ `Array`

Returns an array of spans each representing a named entity.

Returns:

(Array)

# File 'lib/ruby-spacy.rb', line 189

def ents
  # so that ents canbe "each"-ed in Ruby
  ent_array = []
  PyCall::List.call(@py_doc.ents).each do |ent|
    ent.define_singleton_method :label do
      label_
    end
    ent_array << ent
  end
  ent_array
end

#noun_chunks ⇒ `Array`

Returns an array of spans representing noun chunks.

Returns:

(Array)

# File 'lib/ruby-spacy.rb', line 167

def noun_chunks
  chunk_array = []
  py_chunks = PyCall::List.call(@py_doc.noun_chunks)
  py_chunks.each do |py_chunk|
    chunk_array << Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1)
  end
  chunk_array
end

#openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini") ⇒ `Object`

# File 'lib/ruby-spacy.rb', line 305

def openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini")
  messages = [
    { role: "system", content: "Complete the text input by the user." },
    { role: "user", content: @text }
  ]
  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  begin
    response = Spacy.openai_client(access_token: access_token).chat(
      parameters: {
        model: model,
        messages: messages,
        max_tokens: max_tokens,
        temperature: temperature
      }
    )
    response.dig("choices", 0, "message", "content")
  rescue StandardError => e
    puts "Error: OpenAI API call failed."
    pp e.message
    pp e.backtrace
  end
end

#openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ `Object`

# File 'lib/ruby-spacy.rb', line 330

def openai_embeddings(access_token: nil, model: "text-embedding-ada-002")
  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  begin
    response = Spacy.openai_client(access_token: access_token).embeddings(
      parameters: {
        model: model,
        input: @text
      }
    )
    response.dig("data", 0, "embedding")
  rescue StandardError => e
    puts "Error: OpenAI API call failed."
    pp e.message
    pp e.backtrace
  end
end

#openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini", messages: [], prompt: nil) ⇒ `Object`

# File 'lib/ruby-spacy.rb', line 227

def openai_query(access_token: nil,
                 max_tokens: 1000,
                 temperature: 0.7,
                 model: "gpt-4o-mini",
                 messages: [],
                 prompt: nil)
  if messages.empty?
    messages = [
      { role: "system", content: prompt },
      { role: "user", content: @text }
    ]
  end

  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  begin
    response = Spacy.openai_client(access_token: access_token).chat(
      parameters: {
        model: model,
        messages: messages,
        max_tokens: max_tokens,
        temperature: temperature,
        function_call: "auto",
        stream: false,
        functions: [
          {
            name: "get_tokens",
            description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology",
            "parameters": {
              "type": "object",
              "properties": {
                "text": {
                  "type": "string",
                  "description": "text to be tokenized"
                }
              },
              "required": ["text"]
            }
          }
        ]
      }
    )

    message = response.dig("choices", 0, "message")

    if message["role"] == "assistant" && message["function_call"]
      messages << message
      function_name = message.dig("function_call", "name")
      _args = JSON.parse(message.dig("function_call", "arguments"))
      case function_name
      when "get_tokens"
        res = tokens.map do |t|
          {
            "surface": t.text,
            "lemma": t.lemma,
            "pos": t.pos,
            "tag": t.tag,
            "dep": t.dep,
            "ent_type": t.ent_type,
            "morphology": t.morphology
          }
        end.to_json
      end
      messages << { role: "system", content: res }
      openai_query(access_token: access_token, max_tokens: max_tokens,
                   temperature: temperature, model: model,
                   messages: messages, prompt: prompt)
    else
      message["content"]
    end
  rescue StandardError => e
    puts "Error: OpenAI API call failed."
    pp e.message
    pp e.backtrace
  end
end

#respond_to_missing?(sym) ⇒ `Boolean`

Returns:

(Boolean)



354
355
356

# File 'lib/ruby-spacy.rb', line 354

def respond_to_missing?(sym)
  sym ? true : super
end

#retokenize(start_index, end_index, attributes = {}) ⇒ `Object`

Retokenizes the text merging a span into a single token.

Parameters:

start_index (Integer) —

the start position of the span to be retokenized in the document
end_index (Integer) —

the end position of the span to be retokenized in the document
attributes (Hash) (defaults to: {}) —

attributes to set on the merged token

# File 'lib/ruby-spacy.rb', line 104

def retokenize(start_index, end_index, attributes = {})
  PyCall.with(@py_doc.retokenize) do |retokenizer|
    retokenizer.merge(@py_doc[start_index..end_index], attrs: attributes)
  end
end

#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ `Object`

Retokenizes the text splitting the specified token.

Parameters:

pos_in_doc (Integer) —

the position of the span to be retokenized in the document
split_array (Array<String>) —

text strings of the split results
ancestor_pos (Integer) —

the position of the immediate ancestor element of the split elements in the document
attributes (Hash) (defaults to: {}) —

the attributes of the split elements

# File 'lib/ruby-spacy.rb', line 115

def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {})
  PyCall.with(@py_doc.retokenize) do |retokenizer|
    heads = [[@py_doc[pos_in_doc], head_pos_in_split], @py_doc[ancestor_pos]]
    retokenizer.split(@py_doc[pos_in_doc], split_array, heads: heads, attrs: attributes)
  end
end

#sents ⇒ `Array`

Returns an array of spans each representing a sentence.

Returns:

(Array)

# File 'lib/ruby-spacy.rb', line 178

def sents
  sentence_array = []
  py_sentences = PyCall::List.call(@py_doc.sents)
  py_sentences.each do |py_sent|
    sentence_array << Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1)
  end
  sentence_array
end

#similarity(other) ⇒ `Float`

Returns a semantic similarity estimate.

Parameters:

other (Doc) —

the other doc to which a similarity estimation is made

Returns:

(Float)



215
216
217

# File 'lib/ruby-spacy.rb', line 215

def similarity(other)
  py_doc.similarity(other.py_doc)
end

#span(range_or_start, optional_size = nil) ⇒ `Span`

Returns a span of the specified range within the doc. The method should be used either of the two ways: ‘Doc#span(range)` or `Doc#spansize_of_span`.

Parameters:

range_or_start (Range, Integer) —

a range object, or, alternatively, an integer that represents the start position of the span
optional_size (Integer) (defaults to: nil) —

an integer representing the size of the span

Returns:

(Span)

# File 'lib/ruby-spacy.rb', line 150

def span(range_or_start, optional_size = nil)
  if optional_size
    start_index = range_or_start
    temp = tokens[start_index...start_index + optional_size]
  else
    start_index = range_or_start.first
    range = range_or_start
    temp = tokens[range]
  end

  end_index = start_index + temp.size - 1

  Span.new(self, start_index: start_index, end_index: end_index)
end

#to_s ⇒ `String`

String representation of the document.

Returns:

(String)



124
125
126

# File 'lib/ruby-spacy.rb', line 124

def to_s
  @text
end

#tokens ⇒ `Array<Token>`

Returns an array of tokens contained in the doc.

Returns:

(Array<Token>)

# File 'lib/ruby-spacy.rb', line 130

def tokens
  results = []
  PyCall::List.call(@py_doc).each do |py_token|
    results << Token.new(py_token)
  end
  results
end

Class: Spacy::Doc

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ `Doc`

Dynamic Method Handling

#method_missing(name, *args) ⇒ `Object`

Instance Attribute Details

#py_doc ⇒ `Object` (readonly)

#py_nlp ⇒ `Object` (readonly)

#text ⇒ `String` (readonly)

Instance Method Details

#[](range) ⇒ `Object`

#displacy(style: "dep", compact: false) ⇒ `String`

#each ⇒ `Object`

#ents ⇒ `Array<Span>`

#noun_chunks ⇒ `Array<Span>`

#openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini") ⇒ `Object`

#openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ `Object`

#openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini", messages: [], prompt: nil) ⇒ `Object`

#respond_to_missing?(sym) ⇒ `Boolean`

#retokenize(start_index, end_index, attributes = {}) ⇒ `Object`

#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ `Object`

#sents ⇒ `Array<Span>`

#similarity(other) ⇒ `Float`

#span(range_or_start, optional_size = nil) ⇒ `Span`

#to_s ⇒ `String`

#tokens ⇒ `Array<Token>`

Class: Spacy::Doc

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc

Dynamic Method Handling

#method_missing(name, *args) ⇒ Object

Instance Attribute Details

#py_doc ⇒ Object (readonly)

#py_nlp ⇒ Object (readonly)

#text ⇒ String (readonly)

Instance Method Details

#[](range) ⇒ Object

#displacy(style: "dep", compact: false) ⇒ String

#each ⇒ Object

#ents ⇒ Array<Span>

#noun_chunks ⇒ Array<Span>

#openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini") ⇒ Object

#openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ Object

#openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini", messages: [], prompt: nil) ⇒ Object

#respond_to_missing?(sym) ⇒ Boolean

#retokenize(start_index, end_index, attributes = {}) ⇒ Object

#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object

#sents ⇒ Array<Span>

#similarity(other) ⇒ Float

#span(range_or_start, optional_size = nil) ⇒ Span

#to_s ⇒ String

#tokens ⇒ Array<Token>

#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ `Doc`

#method_missing(name, *args) ⇒ `Object`

#py_doc ⇒ `Object` (readonly)

#py_nlp ⇒ `Object` (readonly)

#text ⇒ `String` (readonly)

#[](range) ⇒ `Object`

#displacy(style: "dep", compact: false) ⇒ `String`

#each ⇒ `Object`

#ents ⇒ `Array<Span>`

#noun_chunks ⇒ `Array<Span>`

#openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini") ⇒ `Object`

#openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ `Object`

#openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini", messages: [], prompt: nil) ⇒ `Object`

#respond_to_missing?(sym) ⇒ `Boolean`

#retokenize(start_index, end_index, attributes = {}) ⇒ `Object`

#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ `Object`

#sents ⇒ `Array<Span>`

#similarity(other) ⇒ `Float`

#span(range_or_start, optional_size = nil) ⇒ `Span`

#to_s ⇒ `String`

#tokens ⇒ `Array<Token>`