Word embedding dataset from 'NINJAL Web Japanese Corpus'

Asahara Masayuki

首页> 外文期刊>Terminology >Word embedding dataset from 'NINJAL Web Japanese Corpus'

【24h】

Word embedding dataset from 'NINJAL Web Japanese Corpus'

机译：来自“ NINJAL Web日语语料库”的词嵌入数据集

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we present a word embedding dataset NWJC2Vec constructed using 'NINJAL Web Japanese Corpus (NWJC)'. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the 'Word List by Semantic Principles (Bunrui Goihyo)'.

机译：在本文中，我们介绍了使用“ NINJAL网络日语语料库（NWJC）”构建的词嵌入数据集NWJC2Vec。 NWJC是一个包含258亿个令牌的Web爬行文本语料库。我们构造了两种类型的词嵌入数据集：一种是基于表面形式，另一种是基于UniDic提供的完整词素信息，UniDic是日本形态分析程序MeCab的词典。我们将数据集与“语义原则单词列表（Bunrui Goihyo）”进行比较，从而对数据集进行评估。

著录项

来源
《Terminology》 |2018年第1期|7-22|共16页
作者
Asahara Masayuki;
展开▼
作者单位

Natl Inst Japanese Language & Linguist, 10-2 Midori Cho, Tachikawa, Tokyo 1908561, Japan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
word embedding; web corpus; thesaurus; Japanese language;

机译：词嵌入;网络语料库;词库;日语;

相似文献

外文文献
中文文献
专利

1. Archiving and Analysing Techniques of the Ultra-large-scale Web-based Corpus Project of NINJAL, Japan [J] . MASAYUKI ASAHARA, KIKUO MAEKAWA, MIZUHO IMADA, Alexandria . 2014,第1a2期

机译：日本NINJAL的超大规模基于Web的语料库项目的存档和分析技术
2. The impact of corpus domain on word representation: a study on Persian word embeddings [J] . Hadifar Amir, Momtazi Saeedeh Language Resources and Evaluation . 2018,第4期

机译：语料域对单词表示的影响：波斯单词嵌入研究
3. Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity [J] . Juan J. Lastra-Díaz, Josu Goikoetxea, Mohamed Ali Hadj Taieb, Data in Brief . 2019,第1期

机译：用于单词嵌入的大型实验调查的可再现性数据集，以及基于本体的单词相似性方法
4. 'BonTen' - Corpus Concordance System for 'NINJAL Web Japanese Corpus' [C] . Masayuki ASAHARA, Kazuya KAWAHARA, Yuya TAKEI, International conference on computational linguistics . 2016

机译：'BonTen'-'NINJAL Web日语语料库'的语料库协调系统
5. A Method for Extracting Context-Sensitive Semantics of a Concept from a General-Purpose Corpus Using Word Embedding Space and Its Application [D] . Saxena, Aakash. 2020

机译：一种用词嵌入空间和应用程序从通用语料库中提取概念的上下文敏感语义的方法
6. Reproducibility dataset for a large experimental survey on word embeddings and ontology-based methods for word similarity [O] . Juan J. Lastra-Díaz, Josu Goikoetxea, Mohamed Ali Hadj Taieb, 2019

机译：用于单词嵌入的大型实验调查的可重复性数据集以及基于本体的单词相似性方法
7. New word analogy corpus for exploring embeddings of Czech words [O] . Svoboda, Lukáš, Brychcín, Tomáš 2016

机译：用于探索捷克语嵌入的新词类比语料库

Word embedding dataset from 'NINJAL Web Japanese Corpus'

摘要

著录项

相似文献

相关主题

期刊订阅