An automatic indexing technique for Thai texts using frequent max substring

机译：频繁最大基板的泰语文本的自动索引技术

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Thai language is considered as a non-segmented language where words are a string of symbols without explicit word boundaries, and also the structure of written Thai language is highly ambiguous. This problem causes an indexing technique has become a main issue in Thai text retrieval. To construct an inverted index for Thai texts, an index terms extraction technique is usually required to segment texts into index term schemes. Although index terms can be specified manually by experts, this process is very time consuming and labor-intensive. Word segmentation is one of the many techniques that are used to automatically extract index terms from Thai texts. However, most of the word segmentation techniques require linguistic knowledge and the preparation of these approaches is time consuming. An n-gram based approach is another automatic index terms extraction method that is often used as indexing technique for Asian languages including Thai. This approach is language independent which does not require any linguistic knowledge or dictionary. Although the n-gram approach out performs many indexing techniques for Asian languages in term of retrieval effectiveness, the disadvantage of n-gram approach is it suffers from large storage space and long retrieval time. In this paper we present the frequent max substring mining to extract index terms from Thai texts. Our method is language-independent and it does not rely on any dictionary or language grammatical knowledge. Frequent max substring mining is based on text mining that describes a process of discovering useful information or knowledge from unstructured texts. This approach uses the analysis of frequent max substring sets to extract all long and frequently-occurred substrings. We aim to employ the frequent max substring mining algorithm to address the drawback of n-gram based approach by keeping only frequent max substrings to reduce disk space requirement for storing index terms and to reduce the retrieval time in order to deal with the rapid growth of Thai texts.

机译：泰语被认为是一种非分段语言，其中单词是一个没有明确字边界的符号字符串，而书面泰语语言的结构也是非常暧昧的。这个问题导致索引技术已成为泰语文本检索的主要问题。为了构建泰语文本的倒置索引，通常需要索引项提取技术将文本分段为索引术语方案。虽然指数术语可以由专家手动指定，但这个过程非常耗时和劳动密集型。单词分割是用于从泰语文本自动提取索引项的许多技术之一。然而，大多数单词分割技术需要语言知识，这些方法的制备是耗时。基于N-GRAM的方法是另一种自动指标术语提取方法，其通常用作包括泰国的亚洲语言的索引技术。这种方法是独立的语言，不需要任何语言知识或字典。虽然N-GRAM接近在检索效果中对亚洲语言进行了许多索引技术，但是N-GRAM方法的缺点是它受到大存储空间和长检索时间的缺点。在本文中，我们介绍了频繁的Max Substring挖掘，以从泰语文本中提取索引项。我们的方法是独立的语言，它不依赖于任何字典或语言语法知识。频繁的Max Substring挖掘是基于文本挖掘，描述了从非结构化文本发现有用的信息或知识的过程。这种方法使用频繁的最大子串集的分析来提取所有长期和常常发生的子串。我们的目标是通过保持频繁的最大子程来降低存储索引术语的磁盘空间要求并降低检索时间来解决基于MAX基础的方法的频繁的最大型挖掘算法，以减少磁盘空间要求，以减少检索时间以便处理快速增长泰国文本。

著录项

来源
《International Symposium on Natural Language Processing》|2009年||共6页
会议地点
作者
Todsanai Chumwatana; Kok Wai Wong; Hong Xie;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP312-53;
关键词

相似文献

外文文献
中文文献
专利

1. Using Frequent Max Substring Technique for Thai Text Indexing [J] . Todsanai Chumwatana, Kok Wai Wong, Hong Xie Australian journal of intelligent information processing systems . 2012,第2期

机译：使用频繁最大子串技术进行泰文文本索引编制
2. Using Frequent Substring Mining Techniques for Indexing Genome Sequences: A Comparison of Frequent Substring and Frequent Max Substring Algorithms [J] . Todsanai Chumwatana Journal of Advances in Information Technology . 2016,第4期

机译：使用频繁子串挖掘技术为基因组序列建立索引：频繁子串算法和最大最大子串算法的比较
3. A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts [J] . Todsanai Chumwatana, Kok Wai Wong, Hong Xie Journal of Intelligent Learning Systems and Applications . 2010,第3期

机译：基于SOM的文档聚类，使用非分类文本的最大行数子字符串
4. An automatic indexing technique for Thai texts using frequent max substring [C] . Chumwatana Todsanai, Wong Kok Wai, Xie Hong Natural Language Processing, 2009. SNLP '09 . 2009

机译：使用频繁的最大子字符串的泰语文本自动索引技术
5. An experiment in automatic indexing with Korean texts: A comparison of syntactico-statistical and manual methods. [D] . Seo, Eun-Gyoung. 1993

机译：用韩文自动索引的实验：句法统计和手动方法的比较。
6. Unsupervised Mining of Frequent Tags for Clinical Eligibility Text Indexing [O] . Riccardo Miotto, Chunhua Weng -1

机译：用于临床资格文本索引的频繁标签的无监督挖掘
7. An automatic indexing technique for Thai texts using frequent max substring [O] . Chumwatana, T., Wong, K.W., Xie, H. 2009

机译：使用频繁的最大子字符串的泰语文本自动索引技术

An automatic indexing technique for Thai texts using frequent max substring

摘要

著录项

相似文献

相关主题

期刊订阅