Using Frequent Max Substring Technique for Thai Text Indexing

Todsanai Chumwatana; Kok Wai Wong; Hong Xie

首页> 外文期刊>Australian journal of intelligent information processing systems >Using Frequent Max Substring Technique for Thai Text Indexing

【24h】

Using Frequent Max Substring Technique for Thai Text Indexing

机译：使用频繁最大子串技术进行泰文文本索引编制

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper proposes a Thai texts indexing method using a frequent max substring technique to improve the efficiency of indexing Thai texts. Thai texts are considered as un-delimited language where the structure of writing is a string of symbols without explicit word delimiters. Therefore, some pre-processing technique may need to be applied to discover important patterns before indexing can be performed. In this paper, the frequent max substring technique is proposed as a promising alternative for indexing Thai texts. The proposed technique extracts indexing terms as long and frequent substrings, called frequent max substrings, from Thai texts. This method is used to extract the patterns of interest without context consideration and is interested in substrings that occur frequently in Thai texts in order to reduce the number of insignificant indexing terms. It is also language-independent and does not rely on any dictionary or language grammatical knowledge. The new data structure, called Frequent Suffix Trie, is also proposed to assure exhaustive enumeration of substrings to support extracting the frequent max substrings. The frequent max substrings are then used as indexing terms, together with their number of occurrences and positions, to form an index. To illustrate the proposed technique, experimental studies and comparison results on indexing Thai texts are presented in this paper. The results show that the frequent max substring technique provide a more efficient way for storing indexing terms: by indexing only the frequent max substrings.

机译：本文提出了一种使用频繁最大子串技术的泰语文本索引方法，以提高泰语文本索引的效率。泰语文本被视为无界语言，其中书写结构是一串符号，没有明确的单词分隔符。因此，在可以执行索引之前，可能需要应用一些预处理技术来发现重要的模式。在本文中，提出了频繁的最大子字符串技术作为索引泰语文本的有前途的替代方法。所提出的技术从泰语文本中提取索引项作为长且频繁的子字符串，称为频繁最大子字符串。此方法用于提取感兴趣的模式而无需上下文考虑，并且对泰语文本中经常出现的子字符串感兴趣，以便减少无关紧要的索引项的数量。它也与语言无关，并且不依赖于任何词典或语言语法知识。还提出了一种新的数据结构，称为Frequent Suffix Trie，以确保对子字符串进行穷举枚举，以支持提取频繁的最大子字符串。然后，将频繁的最大子字符串以及它们的出现次数和位置用作索引项，以形成索引。为了说明所提出的技术，本文介绍了对泰语文本进行索引的实验研究和比较结果。结果表明，频繁最大子字符串技术提供了一种更有效的存储索引项的方法：通过仅索引频繁最大子字符串。

著录项

来源
《Australian journal of intelligent information processing systems》 |2012年第2期|13-28|共16页
作者
Todsanai Chumwatana; Kok Wai Wong; Hong Xie;
展开▼
作者单位

School of Information Technology, Murdoch University South St, Murdoch, Western Australia 6150;

School of Information Technology, Murdoch University South St, Murdoch, Western Australia 6150;

School of Information Technology, Murdoch University South St, Murdoch, Western Australia 6150;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
入库时间 2022-08-17 13:25:38

相似文献

外文文献
中文文献
专利

1. Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms [J] . Todsanai Chumwatana Journal of Advances in Information Technology . 2016,第4期

机译：使用频繁子串挖掘技术为基因组序列建立索引：频繁子串算法和最大最大子串算法的比较
2. A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts [J] . Todsanai Chumwatana, Kok Wai Wong, Hong Xie Journal of Intelligent Learning Systems and Applications . 2010,第3期

机译：基于SOM的文档聚类，使用非分类文本的最大行数子字符串
3. Using Adaptive Automata in Grammar Based Text Compression to Identify Frequent Substrings [J] . Newton Kiyotaka Miura, Joao Jose Neto International Journal of Computer Science & Information Technology (IJCSIT) . 2017,第2期

机译：在基于语法的文本压缩中使用自适应自动机来识别频繁的子字符串
4. An automatic indexing technique for Thai texts using frequent max substring [C] . Chumwatana Todsanai, Wong Kok Wai, Xie Hong Natural Language Processing, 2009. SNLP '09 . 2009

机译：使用频繁的最大子字符串的泰语文本自动索引技术
5. Spatio-temporal frequent pattern mining for public safety: Concepts and Techniques. [D] . Mohan, Pradeep. 2012

机译：公共安全时空频繁模式挖掘：概念和技术。
6. Unsupervised Mining of Frequent Tags for Clinical Eligibility Text Indexing [O] . Riccardo Miotto, Chunhua Weng -1

机译：用于临床资格文本索引的频繁标签的无监督挖掘
7. A frequent max substring technique for Thai text indexing [O] . Chumwatana Todsanai 2011

机译：泰语文字索引的一种常见的最大子字符串技术

Using Frequent Max Substring Technique for Thai Text Indexing

摘要

著录项

相似文献

相关主题

期刊订阅