首页> 外文会议>International Conference on Web Information Systems and Technologies >WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION
【24h】

WEB PAGE CLASSIFICATION CONSIDERING PAGE GROUP STRUCTURE FOR BUILDING A HIGH-QUALITY HOMEPAGE COLLECTION

机译:考虑页面组结构,用于构建高质量的主页集合的网页分类

获取原文

摘要

We propose a web page classification method for creating a high quality homepage collection considering page group structure. We use support vector machine (SVM) with textual features obtained from each page and its surrounding pages. The surrounding pages are grouped according to connection type (in-link, out-link, and directory entry) and relative URL hierarchy (same, upper, or lower); then an independent feature subset is generated from each group. Feature subsets are further concatenated to compose the feature set of a classifier. The experiment results using ResJ-01 data set manually created by the authors and WebKB data set show the effectiveness of the proposed features compared with a baseline and some prior works. By tuning the classifiers, we then build a three-way classifier using a recall-assured and a precision-assured classifier in combination to accurately select the pages that need manual assessment to assure the required quality. It is also shown to be effective for reducing the amount of manual assessment.
机译:我们提出了一种用于考虑页面组结构的高质量主页收集的网页分类方法。我们使用支持向量机(SVM)具有从每个页面和周围页面获得的文本功能。周围页面根据连接类型(链接,传出链路和目录条目)和相对URL层次结构(相同,大或更低)进行分组;然后从每个组生成一个独立的特征子集。特征子集进一步连接以撰写分类器的功能集。使用作者和Webkb数据集手动创建的Resj-01数据集的实验结果显示了所提出的特征的有效性与基线和一些先前的作品。通过调整分类器,我们使用召回和精确度量的分类器构建三通分类器,组合可准确地选择需要手动评估以确保所需质量的页面。它也显示有效减少手动评估量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号