首页> 外文会议>International Symposium on Natural Language Processing >An automatic indexing technique for Thai texts using frequent max substring
【24h】

An automatic indexing technique for Thai texts using frequent max substring

机译:频繁最大基板的泰语文本的自动索引技术

获取原文

摘要

Thai language is considered as a non-segmented language where words are a string of symbols without explicit word boundaries, and also the structure of written Thai language is highly ambiguous. This problem causes an indexing technique has become a main issue in Thai text retrieval. To construct an inverted index for Thai texts, an index terms extraction technique is usually required to segment texts into index term schemes. Although index terms can be specified manually by experts, this process is very time consuming and labor-intensive. Word segmentation is one of the many techniques that are used to automatically extract index terms from Thai texts. However, most of the word segmentation techniques require linguistic knowledge and the preparation of these approaches is time consuming. An n-gram based approach is another automatic index terms extraction method that is often used as indexing technique for Asian languages including Thai. This approach is language independent which does not require any linguistic knowledge or dictionary. Although the n-gram approach out performs many indexing techniques for Asian languages in term of retrieval effectiveness, the disadvantage of n-gram approach is it suffers from large storage space and long retrieval time. In this paper we present the frequent max substring mining to extract index terms from Thai texts. Our method is language-independent and it does not rely on any dictionary or language grammatical knowledge. Frequent max substring mining is based on text mining that describes a process of discovering useful information or knowledge from unstructured texts. This approach uses the analysis of frequent max substring sets to extract all long and frequently-occurred substrings. We aim to employ the frequent max substring mining algorithm to address the drawback of n-gram based approach by keeping only frequent max substrings to reduce disk space requirement for storing index terms and to reduce the retrieval time in order to deal with the rapid growth of Thai texts.
机译:泰语被认为是一种非分段语言,其中单词是一个没有明确字边界的符号字符串,而书面泰语语言的结构也是非常暧昧的。这个问题导致索引技术已成为泰语文本检索的主要问题。为了构建泰语文本的倒置索引,通常需要索引项提取技术将文本分段为索引术语方案。虽然指数术语可以由专家手动指定,但这个过程非常耗时和劳动密集型。单词分割是用于从泰语文本自动提取索引项的许多技术之一。然而,大多数单词分割技术需要语言知识,这些方法的制备是耗时。基于N-GRAM的方法是另一种自动指标术语提取方法,其通常用作包括泰国的亚洲语言的索引技术。这种方法是独立的语言,不需要任何语言知识或字典。虽然N-GRAM接近在检索效果中对亚洲语言进行了许多索引技术,但是N-GRAM方法的缺点是它受到大存储空间和长检索时间的缺点。在本文中,我们介绍了频繁的Max Substring挖掘,以从泰语文本中提取索引项。我们的方法是独立的语言,它不依赖于任何字典或语言语法知识。频繁的Max Substring挖掘是基于文本挖掘,描述了从非结构化文本发现有用的信息或知识的过程。这种方法使用频繁的最大子串集的分析来提取所有长期和常常发生的子串。我们的目标是通过保持频繁的最大子程来降低存储索引术语的磁盘空间要求并降低检索时间来解决基于MAX基础的方法的频繁的最大型挖掘算法,以减少磁盘空间要求,以减少检索时间以便处理快速增长泰国文本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号