首页> 外文期刊>Frontiers in Applied Mathematics and Statistics >Text Classification Using the N-Gram Graph Representation Model Over High Frequency Data Streams
【24h】

Text Classification Using the N-Gram Graph Representation Model Over High Frequency Data Streams

机译:使用N-Gram图表示模型对高频数据流进行文本分类

获取原文
           

摘要

A prominent challenge in our information age is the classification over high frequency data streams. In this research, we propose an innovative and high-accurate text stream classification model that is designed in an elastic distributed way and is capable to service text load with fluctuated frequency. In this classification model, text is represented as N-Gram Graphs and the classification process takes place using text preprocessing, graph similarity and feature classification techniques following the supervised machine learning approach. The work involves the analysis of many variations of the proposed model and its parameters, such as various representations of text as N-Gram Graphs, graph comparisons metrics and classification methods in order to conclude to the most accurate setup. To deal with the scalability, the availability and the timely response in case of high frequency text we employ the Beam programming model. Using the Beam programming model the classification process occurs as a sequence of distinct tasks and facilitates the distributed implementation of the most computational demanding tasks of the inference stage. The proposed model and the various parameters that constitute it are evaluated experimentally and the high frequency stream emulated using two public datasets (20NewsGroup and Reuters-21578) that are commonly used in the literature for text classification.
机译:在我们的信息时代,一个突出的挑战是高频数据流的分类。在这项研究中,我们提出了一种创新的,高精度的文本流分类模型,该模型以弹性分布式方式设计,并能够适应频率波动的文本负载。在这种分类模型中,将文本表示为N-Gram图,并且在监督的机器学习方法的基础上,使用文本预处理,图相似度和特征分类技术进行分类过程。这项工作涉及对所提出的模型及其参数的许多变体进行分析,例如以N-Gram Graph表示文本的各种表示形式,图形比较指标和分类方法,以便得出最准确的设置。为了处理高频文本的可伸缩性,可用性和及时响应,我们采用了Beam编程模型。使用Beam编程模型,分类过程作为一系列不同的任务发生,并有助于推理阶段最计算量最大的任务的分布式实现。实验对提出的模型和构成模型的各种参数进行了实验评估,并使用了两个公共数据集(20NewsGroup和Reuters-21578)对高频流进行了仿真,这两个数据集在文献中通常用于文本分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号