首页> 外文会议>IEEE Annual Consumer Communications and Networking Conference >Machine-Learning directed Article Detection on the Web using DOM and text-based features
【24h】

Machine-Learning directed Article Detection on the Web using DOM and text-based features

机译:使用DOM和基于文本的特征在Web上进行机器学习

获取原文

摘要

Burgeoning trend of digital advertising have led websites to embody massive irrelevant content such as page navigation, ads, distractions, promotional videos etc‥. on their pages. In contrast to desktop, limited screen space of mobile carrying variety of ineffectual content magnifies the visual complexity of webpage by disrupting focus of user to read useful content. To solve this issue, Reader mode functionality renders a clutter-free version of web page by stripping out all insignificant elements and letting user to read the content they are actually interested in. Existing implementation in our browser involves 2 modules - Article Detection module that runs heuristics to detect whether a web page is article or not, proceeded by Article Extraction module that extracts the main content of the web page by removing all unwanted elements. Our paper focusses on improving the accuracy of article detection logic by proposing a machine-learning based solution that significantly surpasses the present heuristic-based model in terms of accuracy. In order to get rid of false positives on webpages with meaningless content like product descriptions, we extended our previous solution [1] by incorporating backward-elimination heuristics for web page title and extracting strong Boolean predictors. Our approach also solves the problem of detecting reader mode on Google's AMP HTML web pages where main-content is rendered in an iframe. With this model, we were able to achieve a precision of 0.99 and a recall of 0.94, which outperforms the state-of-the-art techniques which are being used by the major Android based web browsers in the market today.
机译:数字广告的蓬勃发展趋势具有LED网站来体现大量无关内容,如页面导航,广告,分心,促销视频等。在他们的页面上。与桌面相比,通过中断用户的重点来读取有用内容,携带各种无效内容的移动携带种类的有限屏幕空间。要解决此问题,读者模式功能通过剥离所有无关紧要的元素并让用户阅读它们实际感兴趣的内容,呈现无杂乱的网页版本。我们浏览器中的现有实现涉及2个模块 - 运行的文章检测模块启发式检测网页是否是物品,由文章提取模块进行,该模块通过删除所有不需要的元素来提取网页的主要内容。我们的论文通过提出基于机器学习的解决方案来提高物品检测逻辑的准确性,以便在准确性方面显着超越了目前的启发式模型。为了以与产品描述类似的无意义内容的网页上的误报,我们通过合并网页标题并提取强大的布尔预测因子来扩展我们之前的解决方案[1]。我们的方法还解决了在iframe中呈现的Google的AMP HTML网页上的读者模式的问题。通过此模型,我们能够实现0.99的精度,并召回0.94,这优于当今市场上的主要基于Android的Web浏览器所使用的最先进的技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号