【24h】

Word Clustering and Disambiguation Based on Co-occurrence Data

机译:基于共现数据的词聚类与消歧

获取原文

摘要

We address the problem of clustering words (or constructing a thesaurus) based on-occurrence data, and using the acquired word classes to improve the accuracy of syntactic disambiguation. We view this problem as that of estimating a joint probability distribution specifying the joint probabilities of word pairs, such as noun verb pairs. We propose an efficient algorithm based on the Minimum Descrption Length (MDL) principle for estimating such a probability distribution. Our method is a natural extension of those proposed in (Brown et al., 1992) and (Li and Abe, 1996), and overcomes their drawbacks while retaining their advantages. We then combined this clustering method with the disambiguation method of (Lia and Abe, 1995) to derive a disambiguation method that makes use of both automatically constructed thesauruses and a hand-made thesaurus. the overal disambiguation accuracy achieved by our method is 85.2percent, which compares favorably against the accuracy (82.4percent) obtained by the state-of-the-art disambiguation method of (Brill and Resnnik, 1994).
机译:我们解决了基于出现数据对单词进行聚类(或构建同义词库)的问题,并使用获取的单词类来提高句法歧义消除的准确性。我们将此问题视为估计指定单词对(例如名词动词对)的联合概率的联合概率分布的问题。我们提出了一种基于最小描述长度(MDL)原理的有效算法,用于估算这种概率分布。我们的方法是(Brown et al。,1992)和(Li and Abe,1996)中提出的方法的自然扩展,并克服了它们的缺点,同时保留了它们的优点。然后,我们将此聚类方法与(Lia and Abe,1995)的消歧方法相结合,得出一种消歧方法,该方法同时使用自动构建的同义词库和手工同义词库。我们的方法实现的总体消歧准确度为85.2%,与通过最新的消歧方法(Brill和Resnnik,1994)获得的准确度(82.4%)相比。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号