首页> 外文会议>Conference on Intelligent Text Processing and Computational Linguistics;CICLing 2014 >Using Word Association Norms to Measure Corpus Representativeness
【24h】

Using Word Association Norms to Measure Corpus Representativeness

机译:使用Word协会规范来测量语料库代表性

获取原文

摘要

An obvious way to measure how representative a corpus is for the language environment of a person would be to observe this person over a longer period of time, record all written or spoken input, and compare this data to the corpus in question. As this is not very practical, we suggest here a more indirect way to do this. Previous work suggests that people's word associations can be derived from corpus statistics. These word associations are known to some degree as psychologists have collected them from test persons in large scale experiments. The output of these experiments are tables of word associations, the so-called word association norms. In this paper we assume that the more representative a corpus is for the language environment of the test persons, the better the associations generated from it should match people's associations. That is, we compare the corpus-generated associations to the association norms collected from humans, and take the similarity between the two as a measure of corpus representativeness. To our knowledge, this is the first attempt to do so.
机译:一种明显的方法来衡量代表性的语料库是针对一个人的语言环境的方式将是在更长的时间内观察这个人,记录所有书面或口头输入,并将这些数据与有关语料库进行比较。由于这不是很实用,我们建议这里是一种更具间接的方式来做这件事。以前的工作表明,人们的单词关联可以源自语料库统计信息。随着心理学家从大规模实验中的测试人员收集它们,这些词协会在某种程度上已知。这些实验的输出是单词关联的表,所谓的Word关联规范。在本文中,我们假设语料库的代表性越多,用于测试人员的语言环境,越好,从它产生的关联应该符合人们的关联。也就是说,我们将语料库生成的关联与来自人类收集的关联规范进行比较,并将两者之间的相似性作为语料库代表性的衡量标准。为了我们的知识,这是第一次尝试这样做。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号