【24h】

DISTRIBUTED MULTI-LINGUAL CONTENT BASED TEXT MINING DML – CBTM

机译:基于分布式多语言内容的文本挖掘DML – CBTM

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

With the explosion in information over the internet,extracting knowledge from media-based data in the form of images, audio streams and videos replacing textual ones is getting more complex. So a comprehensive methodology covering all forms of data are needed which is able to provide the contents of the data in a short period of time. Text mining tools and algorithms are becoming increasingly popular with many of the books, texts and documentation getting converted to soft-copy versions and being made globally accessible. Though this trend is predominantly in English language, the need has arisen for such an approach for other languages too, as many of the ancient and out-of-print texts in different languages are getting ‘softer’ versions for preserving and extraction of Information and Knowledge. In the context of Indian languages this need is more pronounced as many texts in different languages, scripts, different material forms ranging from palm leaves to stone cutting and dialects are available having wealth of information in variety of disciplines. In this paper, we propose a novel contentbased approach and demonstrate for textual data in the first instance, to be termed as CBTM (Content-Based Text-Mining) for knowledge discovery of multilingual texts. The proposed methodology employs a content based approach using keywords and patterns stored in the form of gif strings so that extensions to other forms of data are possible. Potential applications of this approach in a distributed environment are also highlighted. We have used the advertisements in newspapers for demonstrating the system.
机译:随着Internet信息的爆炸式增长,从图像,图像和音频流以及替代文本形式的视频等形式的基于媒体的数据中提取知识变得越来越复杂。因此,需要一种涵盖所有形式数据的综合方法,该方法能够在短时间内提供数据内容。文本挖掘工具和算法正变得越来越流行,许多书籍,文本和文档都已转换为软拷贝版本,并且可以全球访问。尽管这种趋势主要是英语,但是也出现了对其他语言的这种需求,因为许多古老且绝版的不同语言的文本都在“更软”的版本中用于保存和提取信息以及知识。在印度语言的背景下,这种需求更加明显,因为许多文本使用了不同的语言,文字,从棕榈叶到切石的不同材料形式以及方言,并且在各个学科中都有丰富的信息。在本文中,我们提出了一种新颖的基于内容的方法,并首先针对文本数据进行了演示,该方法被称为CBTM(基于内容的文本挖掘),用于多语言文本的知识发现。所提出的方法采用基于内容的方法,该方法使用以gif字符串形式存储的关键字和模式,从而可以扩展到其他形式的数据。还着重介绍了这种方法在分布式环境中的潜在应用。我们已经使用报纸上的广告来演示该系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号