...
首页> 外文期刊>Sadhana: Academy Proceedings in Engineering Science >Word Sense Disambiguation in Bengali language using unsupervised methodology with modifications
【24h】

Word Sense Disambiguation in Bengali language using unsupervised methodology with modifications

机译:使用无监督方法与修改的孟加拉语言中的字感歧义

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In this work, Word Sense Disambiguation (WSD) in Bengali language is implemented using unsupervised methodology. In the first phase of this experiment, sentence clustering is performed using Maximum Entropy method and the clusters are labelled with their innate senses by manual intervention, as these sense-tagged clusters could be used as sense inventories for further experiment. In the next phase, when a test data comes to be disambiguated, the Cosine Similarity Measure is used to find the closeness of that test data with the initially sense-tagged clusters. The minimum distance of that test data from a particular sense-tagged cluster assigns the same sense to the test data as that of the cluster it is assigned with. This strategy is considered as the baseline strategy, which produces 35% accurate result in WSD task. Next, two extensions are adopted over this baseline strategy: (a) Principal Component Analysis (PCA) over the feature vector, which produces 52% accuracy in WSD task and (b) Context Expansion of the sentences using Bengali WordNet coupled with PCA, which produces 61% accuracy in WSD task. The data sets that are used in this work are obtained from the Bengali corpus, developed under the Technology Development for the Indian Languages (TDIL) project of the Government of India, and the lexical knowledge base (i.e., the Bengali WordNet) used in the work is developed at the Indian Statistical Institute, Kolkata, under the Indradhanush Project of the DeitY, Government of India. The challenges and the pitfalls of this work are also described in detail in the pre-conclusion section.
机译:在这项工作中,使用无监督的方法来实现孟加拉语语言中的字感消除歧义(WSD)。在该实验的第一阶段,使用最大熵方法执行句子聚类,并且通过手动干预用它们的先天感测量标记群集,因为这些感觉标记的群集可以用作进一步实验的感测库存。在下一阶段,当测试数据消除歧义时,余弦相似度测量用于找到具有最初感测标记的群集的该测试数据的亲密度。来自特定的感测标记群集中的测试数据的最小距离为测试数据分配了与其分配的群集相同的感觉。该策略被视为基线策略,它在WSD任务中产生了35%的准确结果。接下来,通过该基线策略采用两种扩展:(a)通过传感器的主成分分析(PCA),在WSD任务中产生52%的准确性,并使用孟加拉Wordnet与PCA耦合的句子的上下文扩展,这在WSD任务中产生61%的准确性。本作工作中使用的数据集是从印度政府的印度语言(TDIL)项目的技术开发中开发的孟加拉语料库,以及用于中使用的词汇知识库(即孟加拉Wordnet)工作是在印度统计研究所的印度统计研究所开发的,根据印度政府的神爱Indradhanush项目。在第一次结束部分中也详细描述了这项工作的挑战和陷阱。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号