...
首页> 外文期刊>ACM transactions on Asian language information processing >Keyword Extraction from Arabic Documents using Term Equivalence Classes
【24h】

Keyword Extraction from Arabic Documents using Term Equivalence Classes

机译:使用术语等效类从阿拉伯语文档中提取关键字

获取原文
获取原文并翻译 | 示例
           

摘要

The rapid growth of the Internet and other computing facilities in recent years has resulted in the creation of a large amount of text in electronic form, which has increased the interest in and importance of different automatic text processing applications, including keyword extraction and term indexing. Although keywords are very useful for many applications, most documents available online are not provided with keywords. We describe a method for extracting keywords from Arabic documents. This method identifies the keywords by combining linguistics and statistical analysis of the text without using prior knowledge from its domain or information from any related corpus. The text is preprocessed to extract the main linguistic information, such as the roots and morphological patterns of derivative words. A cleaning phase is then applied to eliminate the meaningless words from the text. The most frequent terms are clustered into equivalence classes in which the derivative words generated from the same root and the non-derivative words generated from the same stem are placed together, and their count is accumulated. A vector space model is then used to capture the most frequent N-gram in the text. Experiments carried out using a real-world dataset show that the proposed method achieves good results with an average precision of 31% and average recall of 53% when tested against manually assigned keywords.
机译:近年来,Internet和其他计算设备的快速发展导致大量电子形式的文本的创建,这增加了人们对包括关键字提取和术语索引在内的各种自动文本处理应用程序的兴趣和重要性。尽管关键字对许多应用程序非常有用,但是大多数在线文档都没有关键字。我们描述了一种从阿拉伯文档中提取关键字的方法。这种方法通过结合语言学和文本的统计分析来识别关键字,而无需使用其领域的先验知识或任何相关语料库的信息。对文本进行预处理以提取主要的语言信息,例如派生词的词根和词形。然后应用清理阶段以从文本中消除无意义的单词。最频繁的词被聚类为等价类,其中从相同词根生成的派生词和从相同词干生成的非派生词放在一起,并对其计数进行累加。然后使用向量空间模型捕获文本中最频繁的N-gram。使用现实世界的数据集进行的实验表明,与手动分配的关键字进行测试时,所提出的方法以31%的平均精度和53%的平均召回率取得了良好的效果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号