首页> 外文会议>International conference on neural information processing >A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification
【24h】

A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification

机译:本地等级距离的快速算法:在阿拉伯语母语识别中的应用

获取原文

摘要

A novel distance measure for strings, termed Local Rank Distance (LRD), was recently introduced. LRD is inspired from rank distance, but it is designed to conform to more general principles, while being more adapted for specific data types, such as DNA strings or text. More precisely, LRD measures the local displacement of character n-grams among two strings. Local Rank Distance has already demonstrated promising results in computational biology and native language identification, but the algorithm used to compute LRD is computationally expensive. In this paper, an efficient algorithm for LRD is proposed. The main efficiency improvement is to build a positional inverted index for the character n-grams in one of the compared strings. Then, for each n-gram in the other string, a binary search is used to find the position of the nearest matching n-gram in the positional inverted index. The proposed algorithm is more than two orders of magnitude faster than the original algorithm. An application of the described algorithm is also exhibited in this paper. Indeed, state of the art results are presented for Arabic native language identification from text documents.
机译:最近引入了一种新颖的字符串距离度量,称为本地等级距离(LRD)。 LRD受等级距离的启发,但其设计旨在遵循更一般的原则,同时更适用于特定的数据类型,例如DNA字符串或文本。更准确地说,LRD测量两个字符串之间字符n-gram的局部位移。本地等级距离已在计算生物学和母语识别中显示出令人鼓舞的结果,但是用于计算LRD的算法在计算上却很昂贵。本文提出了一种有效的LRD算法。主要的效率改进是为比较的字符串之一中的字符n-gram建立位置倒排索引。然后,对于另一个字符串中的每个n-gram,使用二进制搜索来找到位置倒排索引中最匹配的n-gram的位置。所提出的算法比原始算法快两个数量级以上。本文还展示了所描述算法的应用。确实,已提供了最新的结果,可从文本文档中识别阿拉伯语。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号