Class: Unisec::Blocks

Inherits:
Object
  • Object
show all
Defined in:
lib/unisec/blocks.rb

Overview

Operations about Unicode blocks

Constant Summary collapse

UCD_BLOCKS =

UCD Blocks file location

File.join(__dir__, '../../data/Blocks.txt')
INVALID_RANGES =

List of invalid, private, reserved ranges. Unasigned, unallocated ranges are calculated dynamically in list_unassigned.

[
  { range: 0xd800..0xdfff, name: 'Surrogates (invalid outside UTF-16)' },
  { range: 0xe000..0xf8ff, name: 'Private Use Area (located in BMP)' },
  { range: 0xf0000..0xfffff, name: 'Supplementary Private Use Area-A' },
  { range: 0x100000..0x10ffff, name: 'Supplementary Private Use Area-B' }
].freeze

Class Method Summary collapse

Class Method Details

.block(block_arg, with_count: false) ⇒ Hash|nil

Find the block including the target character or code point, or matching the provided name.

Examples:

Unisec::Blocks.block(65, with_count:true) # => {range: 0..127, name: "Basic Latin", range_size: 128, char_count: 95}
Unisec::Blocks.block("U+1f4a9") # => {range: 127744..128511, name: "Miscellaneous Symbols and Pictographs", range_size: nil, char_count: nil}
Unisec::Blocks.block("", with_count:true) # => {range: 8192..8303, name: "General Punctuation", range_size: 112, char_count: 111}
Unisec::Blocks.block("javanese") # => {range: 43392..43487, name: "Javanese", range_size: nil, char_count: nil}

Parameters:

  • block_arg (Integer|String)

    Decimal code point or standardized hexadecimal codepoint or string character (only one, so be careful with emojis, composed or joint characters using several units) or directly look for the block name (case insensitive).

  • with_count (TrueClass|FalseClass) (defaults to: false)

    calculate block's range size & char count?

Returns:

  • (Hash|nil)

    Maching block (block name, range and count) or nil if not found



99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/unisec/blocks.rb', line 99

def self.block(block_arg, with_count: false) # rubocop:disable Metrics/AbcSize,Metrics/CyclomaticComplexity,Metrics/MethodLength,Metrics/PerceivedComplexity
  file = File.new(UCD_BLOCKS)
  found = false
  file.each_line(chomp: true) do |line|
    # Skip if the line is empty or a comment
    next if line.empty? || line[0] == '#'

    # parse the line to extract code point range and the name
    blk_range, blk_name = line.split(';')
    blk_range = Unisec::Utils::String.to_range(blk_range)
    blk_name.lstrip!
    case block_arg
    when Integer # block_arg is an intgeger code point
      found = true if blk_range.include?(block_arg)
    when String # can be a char or block name or a string code point
      if block_arg.size == 1 # is a char (1 code unit, not one grapheme)
        found = true if blk_range.include?(Utils::String.convert_to_integer(block_arg))
      elsif block_arg.start_with?('U+') # string code point
        found = true if blk_range.include?(Utils::String.convert(block_arg, :integer))
      elsif blk_name.downcase == block_arg.downcase # block name
        found = true
      end
    end
    if found
      return {
        range: blk_range,
        name: blk_name,
        range_size: with_count ? blk_range.size : nil,
        char_count: with_count ? count_char_in_block(blk_range) : nil
      }
    end
  end
  nil # not found
end

.block_display(block_arg, with_count: false) ⇒ Object

Display a CLI-friendly output detailing the searched block

Parameters:

  • block_arg (Integer|String)

    Decimal code point or standardized hexadecimal codepoint or string character (only one, so be careful with emojis, composed or joint characters using several units) or directly look for the block name (case insensitive).

  • with_count (TrueClass|FalseClass) (defaults to: false)

    calculate block's range size & char count?



176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# File 'lib/unisec/blocks.rb', line 176

def self.block_display(block_arg, with_count: false)
  blk = block(block_arg, with_count: with_count)
  if blk.nil?
    puts "no block found with #{block_arg}"
  else
    display = ->(key, value) { puts Paint[key, :red, :bold] + " #{value}" }
    display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]))
    display.call('Name:', blk[:name])
    if with_count
      display.call('Range size:', blk[:range_size])
      display.call('Char count:', blk[:char_count])
    end
  end
  nil
end

.count_char_in_block(range) ⇒ Integer

Count the number of characters allocated in a block. ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt.

Examples:

Unisec::Blocks::count_char_in_block(0xAC00..0xD7AF) # => 11172

Parameters:

  • range (Range)

    Block code point range

Returns:

  • (Integer)

    number of code points in the block



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/unisec/blocks.rb', line 65

def self.count_char_in_block(range) # rubocop:disable Metrics/AbcSize
  counter = 0
  file = File.new(Rugrep::UCD_DERIVEDNAME)
  file.each_line(chomp: true) do |line|
    # Skip if the line is empty or a comment
    next if line.empty? || line[0] == '#'

    # parse the line to extract code point as integer and the name
    cp_int, _name = line.split(';')
    if cp_int.include?('..') # handle ranges in DerivedName.txt
      ucd_range = Utils::String.to_range(cp_int)
      next unless range.include_range?(ucd_range)

      counter += ucd_range.size
      next
    end
    cp_int = cp_int.chomp.to_i(16)
    next unless range.include?(cp_int)

    counter += 1
    break if cp_int == range.end
  end
  counter
end

.list(with_count: false) ⇒ Array<Hash>

List Unicode blocks name ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt. ⚠️ Populating char_count is slow and can take a few seconds.

Examples:

Unisec::Blocks.list # => [{range: 0..127, name: "Basic Latin", range_size: nil, char_count: nil}, … ]
Unisec::Blocks.list(with_count: true) # => [{range: 0..127, name: "Basic Latin", range_size: 128, char_count: 95}, … ]

Parameters:

  • with_count (TrueClass|FalseClass) (defaults to: false)

    calculate block's range size & char count?

Returns:

  • (Array<Hash>)

    List of blocks (block name, range and count)



38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/unisec/blocks.rb', line 38

def self.list(with_count: false)
  out = []
  file = File.new(UCD_BLOCKS)
  file.each_line(chomp: true) do |line|
    # Skip if the line is empty or a comment
    next if line.empty? || line[0] == '#'

    # parse the line to extract code point range and the name
    blk_range, blk_name = line.split(';')
    blk_range = Unisec::Utils::String.to_range(blk_range)
    blk_name.lstrip!
    out << {
      range: blk_range,
      name: blk_name,
      range_size: with_count ? blk_range.size : nil,
      char_count: with_count ? count_char_in_block(blk_range) : nil
    }
  end
  out
end

.list_display(with_count: false) ⇒ Object

Display a CLI-friendly output listing all blocks

Parameters:

  • with_count (TrueClass|FalseClass) (defaults to: false)

    calculate block's range size & char count?



158
159
160
161
162
163
164
165
166
167
168
169
170
171
# File 'lib/unisec/blocks.rb', line 158

def self.list_display(with_count: false) # rubocop:disable Metrics/AbcSize
  blocks = list(with_count: with_count)
  display = ->(key, value, just) { print Paint[key, :red, :bold] + " #{value}".ljust(just) }
  blocks.each do |blk|
    display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]), 22)
    display.call('Name:', blk[:name], 50)
    if with_count
      display.call('Range size:', blk[:range_size], 8)
      display.call('Char count:', blk[:char_count], 0)
    end
    puts
  end
  nil
end

.list_invalid_displayObject

Display a CLI-friendly output listing all invalid and unsassigned ranges.



193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# File 'lib/unisec/blocks.rb', line 193

def self.list_invalid_display # rubocop:disable Metrics/AbcSize
  display = ->(key, value, just) { print Paint[key, :red, :bold] + " #{value}".ljust(just) }
  puts '(Assigned) invalid, private, reserved ranges:'
  INVALID_RANGES.each do |blk|
    display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]), 22)
    display.call('Name:', blk[:name], 50)
    puts
  end
  puts "\nUnasigned, unallocated ranges:"
  list_unassigned.each do |blk|
    display.call('Range:', Utils::Range.range2codepoint_range(blk), 22)
    puts
  end
  nil
end

.list_unassignedArray<Range>

List unasigned, unallocated ranges.

Examples:

Unisec::Blocks.list_unassigned # => [12256..12271, 66048..66175, …]

Returns:

  • (Array<Range>)

    List of unassigned (code-point) ranges



138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# File 'lib/unisec/blocks.rb', line 138

def self.list_unassigned # rubocop:disable Metrics/AbcSize
  base = (0x0000..0x10ffff)
  assigned = Unisec::Blocks.list.map { |b| b[:range] }

  unassigned = []
  cursor = base.begin

  assigned.each do |r|
    unassigned << (cursor..(r.begin - 1)) if cursor < r.begin
    cursor = r.end + 1
    break if cursor > base.end
  end

  unassigned << (cursor..base.end) if cursor <= base.end

  unassigned
end

.ucd_blocks_versionString

Returns the version of Unicode used in UCD local file (data/Blocks.txt)

Examples:

Unisec::Blocks.ucd_blocks_version # => "17.0.0"

Returns:

  • (String)

    Unicode version



25
26
27
28
# File 'lib/unisec/blocks.rb', line 25

def self.ucd_blocks_version
  first_line = File.open(UCD_BLOCKS, &:readline)
  first_line.match(/-(\d+\.\d+\.\d+)\.txt/).captures.first
end