Class: Datasets::WikipediaKyotoJapaneseEnglish

Inherits:
Dataset
  • Object
show all
Includes:
TarGzReadable
Defined in:
lib/datasets/wikipedia-kyoto-japanese-english.rb

Defined Under Namespace

Classes: Article, ArticleListener, Entry, Paragraph, Section, Sentence, Title

Instance Attribute Summary

Attributes inherited from Dataset

#metadata

Instance Method Summary collapse

Methods included from TarGzReadable

#open_tar_gz

Methods inherited from Dataset

#clear_cache!, #to_table

Constructor Details

#initialize(type: :article) ⇒ WikipediaKyotoJapaneseEnglish

Returns a new instance of WikipediaKyotoJapaneseEnglish.



55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/datasets/wikipedia-kyoto-japanese-english.rb', line 55

def initialize(type: :article)
  unless [:article, :lexicon].include?(type)
    raise ArgumentError, "Please set type :article or :lexicon: #{type.inspect}"
  end

  super()
  @type = type
  @metadata.id = "wikipedia-kyoto-japanese-english"
  @metadata.name =
    "The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles"
  @metadata.url = "https://alaginrc.nict.go.jp/WikiCorpus/index_E.html"
  @metadata.licenses = ["CC-BY-SA-3.0"]
  @metadata.description = <<-DESCRIPTION
"The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles"
aims mainly at supporting research and development relevant to
high-performance multilingual machine translation, information
extraction, and other language processing technologies. The National
Institute of Information and Communications Technology (NICT) has
created this corpus by manually translating Japanese Wikipedia
articles (related to Kyoto) into English.
  DESCRIPTION
end

Instance Method Details

#each(&block) ⇒ Object



78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/datasets/wikipedia-kyoto-japanese-english.rb', line 78

def each(&block)
  return to_enum(__method__) unless block_given?

  data_path = download_tar_gz

  open_tar_gz(data_path) do |tar|
    tar.each do |entry|
      next unless entry.file?
      base_name = File.basename(entry.full_name)
      case @type
      when :article
        next unless base_name.end_with?(".xml")
        listener = ArticleListener.new(block)
        parser = REXML::Parsers::StreamParser.new(entry.read, listener)
        parser.parse
      when :lexicon
        next unless base_name == "kyoto_lexicon.csv"
        is_header = true
        CSV.parse(entry.read.force_encoding("UTF-8")) do |row|
          if is_header
            is_header = false
            next
          end
          yield(Entry.new(row[0], row[1]))
        end
      end
    end
  end
end