首页> 外文会议>International Conference on Knowledge Discovery and Information Retrieval >Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification
【24h】

Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification

机译:基于N-GRAM的相似性分析在Web内容分类中具有情感分析

获取原文

摘要

This research concerns the development of web content detection systems that will be able to automatically classify any web page into pre-defined content categories. Our work is motivated by practical experience and observations that certain categories of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into account. To further improve the performance of detection systems, we bring web sentiment features into classification models. In addition, we incorporate n-gram representation into our classification approach, based on the assumption that n-grams can capture more local context information in text, and thus could help to enhance topic similarity analysis. Different from most studies that only consider presence or frequency count of n-grams in their applications, we make use of tf-idf weighted n-grams in building the content classification models. Our result shows that unigram based models, even though a much simpler approach, show their unique value and effectiveness in web content classification. Higher order n-gram based approaches, especially 5-gram based models that combine topic similarity features with sentiment features, bring significant improvement in precision levels for the Violence and two Racism related web categories.
机译:本研究涉及Web内容检测系统的开发,该系统将能够自动将任何网页分类为预定定义的内容类别。我们的工作是通过实际经验的推动和观察,即某些类别的网页,例如包含仇恨和暴力的类别,在已经考虑过内容和结构特征时,良好的准确性更难分类。为了进一步提高检测系统的性能,我们将网络情绪特征带入分类模型中。此外,我们基于N-GRAM可以在文本中捕获更多本地上下文信息,因此可以帮助提高主题相似性分析,并将n-gram表示纳入我们的分类方法。与大多数研究不同,只考虑其应用中N-GRAM的存在或频率计数,我们在构建内容分类模型时利用TF-IDF加权N-GRAM。我们的结果表明,即使采用更简单的方法,即使是更简单的方法,也表明了网上内容分类中的独特价值和有效性。高阶N-GRAM的方法,特别是5克基于5克的模型,即将主题相似性具有情绪特征,为暴力和两个种族主义相关的Web类别带来了显着提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号