首页> 外文期刊>Language Resources and Evaluation >Constructing two Vietnamese corpora and building a lexical database
【24h】

Constructing two Vietnamese corpora and building a lexical database

机译:构建两个越南语的基础,建立词汇数据库

获取原文
获取原文并翻译 | 示例
       

摘要

Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese corpora (a corpus from Vietnamese film subtitles, subtlex-viet, and a general corpus of varieties of online newspapers and stories, genlex-viet). We document the general steps of the construction and extraction of linguistic information from the language corpora and provide a road map for others who would like to create similar corpora. The resultant corpora are available in three versions: plain text, tokenized, and POS tagged. In the second half of the paper, the construction of a lexical database derived from the corpora is described. The database includes measures such as frequency of occurrence, dispersion, Mutual Information, Inverse Document Frequency, as well as vector space measures based on Latent Semantic Analysis and Hyperspace Analogue to Language. We conclude by reporting a comparison of the lexical predictors and a validation using psycholinguistic data from visual lexical decision experiments.
机译:基于语料库的研究在近几十年来形成了语言研究的骨干。大型文本语料库用于解决各种语言问题,包括定量语言学,认知语言学和精神语言学。本文报告了创造了两位当代越南人。它还描述了这两种平等大小的越南语料库(来自越南电影字幕,Sumtex-Viet的语料库,以及各种各样的网上报纸和故事,Genlex-Viet)。我们记录了语言语料库的建设和提取语言信息的一般步骤,并为其他想创建类似的公司的其他人提供路线图。 Colultant Corpora有三个版本:纯文本,标记和POS标记。在本文的下半部分中,描述了来自Corpora的词汇数据库的构建。该数据库包括诸如出现频率,分散,互信息,逆文档频率以及基于潜在语义分析和超空白模拟的频率的措施。我们通过从视觉词汇决策实验中报告了词汇预测因子的比较和使用精神语言数据的验证的比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号