首页> 外文期刊>Language Resources and Evaluation >Constructing two Vietnamese corpora and building a lexical database
【24h】

Constructing two Vietnamese corpora and building a lexical database

机译:构建两个越南语语料库并建立词汇数据库

获取原文
获取原文并翻译 | 示例
       

摘要

Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese corpora (a corpus from Vietnamese film subtitles, subtlex-viet, and a general corpus of varieties of online newspapers and stories, genlex-viet). We document the general steps of the construction and extraction of linguistic information from the language corpora and provide a road map for others who would like to create similar corpora. The resultant corpora are available in three versions: plain text, tokenized, and POS tagged. In the second half of the paper, the construction of a lexical database derived from the corpora is described. The database includes measures such as frequency of occurrence, dispersion, Mutual Information, Inverse Document Frequency, as well as vector space measures based on Latent Semantic Analysis and Hyperspace Analogue to Language. We conclude by reporting a comparison of the lexical predictors and a validation using psycholinguistic data from visual lexical decision experiments.
机译:基于语料库的研究已成为近几十年来语言学研究的骨干。大文本语料库用于解决各种语言问题,包括定量语言学,认知语言学和心理语言学的问题。本文报告了当代越南语两个语料库的创建。它还描述了这两个大小相等的越南语语料库(越南语电影字幕的语料库,含蓄的越南语,以及各种在线报纸和故事的一般语料库,genlex-viet)的构建。我们记录了从语言语料库构建和提取语言信息的一般步骤,并为其他想要创建类似语料库的人提供了路线图。结果语料库有三种版本:纯文本,标记化和POS标记。在本文的后半部分,描述了从语料库派生的词汇数据库的构建。该数据库包括诸如出现频率,分散度,互信息,逆文档频率的度量,以及基于潜在语义分析和语言超空间模拟的向量空间度量。我们通过报告词汇预测变量的比较和使用来自视觉词汇决策实验的心理语言数据进行的验证来得出结论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号