首页> 外文期刊>Data mining and knowledge discovery >Exploiting link structure for web page genre identification
【24h】

Exploiting link structure for web page genre identification

机译:利用链接结构进行网页类型识别

获取原文
获取原文并翻译 | 示例
           

摘要

As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information.
机译:随着万维网以前所未有的速度发展,识别网页类型由于其在网页搜索中的重要性,最近引起了越来越多的关注。识别类型的常见方法是使用可以直接从网页提取的文本功能,即“页面上的功能”。随后将提取的特征输入到将执行分类的机器学习算法中。但是,当网页包含有限的文本信息(例如,页面中充满图像)时,这些方法可能无效。在这项研究中,我们针对上述情况下的网页类型识别。我们提出了一个框架,该框架使用“页面上”功能,同时考虑相邻页面(即通过向后和向前链接连接到原始页面的页面)中的信息。我们首先介绍一个称为GenreSim的基于图的模型,该模型选择适当的一组相邻页面。然后,我们构造一个多分类器组合模块,该模块利用来自所选相邻页面和On-Page功能的信息来提高类型识别的性能。对著名的语料库进行了实验,良好的结果表明我们提出的框架是有效的,特别是在识别文本信息有限的网页方面。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号