首页> 外文会议>Proceedings of the Fifth workshop on very large corpora >Using word frequency lists to measure corpus homogeneity and similarity between corpora
【24h】

Using word frequency lists to measure corpus homogeneity and similarity between corpora

机译:使用词频表来测量语料库之间的语料同质性和相似性

获取原文
获取原文并翻译 | 示例

摘要

How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering.Word frequency lists are cheap and easy to generate so a measrue based on them world be of use as a quick guide in many circumstances; fro example,to judge how a newly available corpus related to exisxting resources,or how easy it might be to port an NLP system designed to work with one text tyep to work with another.We hsow that corpus similarity cna only be intepreted in the light of corpus homogeneity.the paper presents a measure,based on the X~2 statistic,fro measuring both corpus similarity and corpus homogeneity.The measure is compared with a rank-based measure and hsown to outprform it.Some results are presented.A method fro evaluating the accuracy of the measure is introduced and some results of using the measure are presented.
机译:两个语料库有多相似?语料相似度的测量对词典学和语言工程非常有用。词频表便宜且易于生成,因此基于它们的测度表可在许多情况下用作快速指南;例如,要判断一个新的语料库与现有资源的关系如何,或者移植一个设计用于一种文本类型的NLP系统与另一种文字类型一起使用可能有多容易。我们认为,语料库相似性cna只能从角度来理解本文基于X〜2统计量,提出了一种度量语料相似度和语料均质性的度量方法,将该度量值与基于等级的度量值进行比较,并据此进行超越。本文提出了一些结果。一种方法介绍了评估该方法的准确性,并给出了使用该方法的一些结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号