首页> 外文期刊>Australian journal of intelligent information processing systems >Using Frequent Max Substring Technique for Thai Text Indexing
【24h】

Using Frequent Max Substring Technique for Thai Text Indexing

机译:使用频繁最大子串技术进行泰文文本索引编制

获取原文
获取原文并翻译 | 示例
       

摘要

This paper proposes a Thai texts indexing method using a frequent max substring technique to improve the efficiency of indexing Thai texts. Thai texts are considered as un-delimited language where the structure of writing is a string of symbols without explicit word delimiters. Therefore, some pre-processing technique may need to be applied to discover important patterns before indexing can be performed. In this paper, the frequent max substring technique is proposed as a promising alternative for indexing Thai texts. The proposed technique extracts indexing terms as long and frequent substrings, called frequent max substrings, from Thai texts. This method is used to extract the patterns of interest without context consideration and is interested in substrings that occur frequently in Thai texts in order to reduce the number of insignificant indexing terms. It is also language-independent and does not rely on any dictionary or language grammatical knowledge. The new data structure, called Frequent Suffix Trie, is also proposed to assure exhaustive enumeration of substrings to support extracting the frequent max substrings. The frequent max substrings are then used as indexing terms, together with their number of occurrences and positions, to form an index. To illustrate the proposed technique, experimental studies and comparison results on indexing Thai texts are presented in this paper. The results show that the frequent max substring technique provide a more efficient way for storing indexing terms: by indexing only the frequent max substrings.
机译:本文提出了一种使用频繁最大子串技术的泰语文本索引方法,以提高泰语文本索引的效率。泰语文本被视为无界语言,其中书写结构是一串符号,没有明确的单词分隔符。因此,在可以执行索引之前,可能需要应用一些预处理技术来发现重要的模式。在本文中,提出了频繁的最大子字符串技术作为索引泰语文本的有前途的替代方法。所提出的技术从泰语文本中提取索引项作为长且频繁的子字符串,称为频繁最大子字符串。此方法用于提取感兴趣的模式而无需上下文考虑,并且对泰语文本中经常出现的子字符串感兴趣,以便减少无关紧要的索引项的数量。它也与语言无关,并且不依赖于任何词典或语言语法知识。还提出了一种新的数据结构,称为Frequent Suffix Trie,以确保对子字符串进行穷举枚举,以支持提取频繁的最大子字符串。然后,将频繁的最大子字符串以及它们的出现次数和位置用作索引项,以形成索引。为了说明所提出的技术,本文介绍了对泰语文本进行索引的实验研究和比较结果。结果表明,频繁最大子字符串技术提供了一种更有效的存储索引项的方法:通过仅索引频繁最大子字符串。

著录项

  • 来源
  • 作者单位

    School of Information Technology, Murdoch University South St, Murdoch, Western Australia 6150;

    School of Information Technology, Murdoch University South St, Murdoch, Western Australia 6150;

    School of Information Technology, Murdoch University South St, Murdoch, Western Australia 6150;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 13:25:38

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号