首页> 外文会议>International Conference on Web Engineering >Web Page Structured Content Detection Using Supervised Machine Learning
【24h】

Web Page Structured Content Detection Using Supervised Machine Learning

机译:使用监督机器学习的网页结构化内容检测

获取原文

摘要

In this paper we present a comparative study using several supervised machine learning techniques, including homogeneous and heterogeneous ensembles, to solve the problem of classifying content and noise in web pages. We specifically tackle the problem of detecting content in semi-structured data (e.g., e-commerce search results) under two different settings: a controlled environment with only structured content documents and; an open environment where the web page being processed may or may not have structured content. The features are automatically obtained from a preexisting and publicly available extraction technique that processes web pages as a sequence of tag paths, thus the features are extracted from these sequences instead of the DOM tree. Besides comparing the performance between different models we have also conducted extensive feature selection/combination experiments. We have achieved an average F-score of about 93% in a controlled setting and 91% in an open setting.
机译:在本文中,我们将使用几种监督的机器学习技术(包括同构和异类合奏)进行比较研究,以解决对网页中的内容和噪声进行分类的问题。我们专门解决了在两种不同设置下检测半结构化数据(例如,电子商务搜索结果)中的内容的问题:只有结构化内容文档的受控环境;以及一个开放的环境,正在处理的网页可能具有也可能没有结构化的内容。这些功能是从预先存在且可公开获得的提取技术中自动获取的,该技术将网页作为标记路径序列进行处理,因此,这些功能是从这些序列中提取的,而不是从DOM树中提取的。除了比较不同模型之间的性能外,我们还进行了广泛的功能选择/组合实验。在受控环境下,我们的平均F分数约为93%;在开放环境下,我们的平均F分数为91%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号