Machine-Learning directed Article Detection on the Web using DOM and text-based features

机译：使用DOM和基于文本的特征在Web上进行机器学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Burgeoning trend of digital advertising have led websites to embody massive irrelevant content such as page navigation, ads, distractions, promotional videos etc‥. on their pages. In contrast to desktop, limited screen space of mobile carrying variety of ineffectual content magnifies the visual complexity of webpage by disrupting focus of user to read useful content. To solve this issue, Reader mode functionality renders a clutter-free version of web page by stripping out all insignificant elements and letting user to read the content they are actually interested in. Existing implementation in our browser involves 2 modules - Article Detection module that runs heuristics to detect whether a web page is article or not, proceeded by Article Extraction module that extracts the main content of the web page by removing all unwanted elements. Our paper focusses on improving the accuracy of article detection logic by proposing a machine-learning based solution that significantly surpasses the present heuristic-based model in terms of accuracy. In order to get rid of false positives on webpages with meaningless content like product descriptions, we extended our previous solution [1] by incorporating backward-elimination heuristics for web page title and extracting strong Boolean predictors. Our approach also solves the problem of detecting reader mode on Google's AMP HTML web pages where main-content is rendered in an iframe. With this model, we were able to achieve a precision of 0.99 and a recall of 0.94, which outperforms the state-of-the-art techniques which are being used by the major Android based web browsers in the market today.

机译：数字广告的蓬勃发展趋势具有LED网站来体现大量无关内容，如页面导航，广告，分心，促销视频等。在他们的页面上。与桌面相比，通过中断用户的重点来读取有用内容，携带各种无效内容的移动携带种类的有限屏幕空间。要解决此问题，读者模式功能通过剥离所有无关紧要的元素并让用户阅读它们实际感兴趣的内容，呈现无杂乱的网页版本。我们浏览器中的现有实现涉及2个模块 - 运行的文章检测模块启发式检测网页是否是物品，由文章提取模块进行，该模块通过删除所有不需要的元素来提取网页的主要内容。我们的论文通过提出基于机器学习的解决方案来提高物品检测逻辑的准确性，以便在准确性方面显着超越了目前的启发式模型。为了以与产品描述类似的无意义内容的网页上的误报，我们通过合并网页标题并提取强大的布尔预测因子来扩展我们之前的解决方案[1]。我们的方法还解决了在iframe中呈现的Google的AMP HTML网页上的读者模式的问题。通过此模型，我们能够实现0.99的精度，并召回0.94，这优于当今市场上的主要基于Android的Web浏览器所使用的最先进的技术。

著录项

来源
《IEEE Annual Consumer Communications and Networking Conference》|2021年|1-5|共5页
会议地点
作者
Shobhit Mathur; Pritam Nikam; Harshita Patidar; Rohan Bapusaheb Gaikwad; Preeti Narayan Nayak;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Visualization; Navigation; Web pages; Feature extraction; Browsers; Random forests; Videos;

机译：可视化;导航;网页;特征提取;浏览器;随机森林;视频;

相似文献

外文文献
中文文献
专利

1. Tyrepress.com Launches New Website: New directory, forum and article features upgrade site [J] . Christopher Anthony Tyres and Accessories . 2009,第Auga期

机译：Tyrepress.com启动新网站：新目录，论坛和文章功能升级网站
2. Machine-learning based feature selection for a non-invasive breathing change detection [J] . Juliana Alves Pegoraro, Sophie Lavault, Nicolas Wattiez, BioData Mining . 2021,第1期

机译：基于机器学习的特征选择，用于非侵入性呼吸变化检测
3. A Detection Method for Phishing Web Page Using DOM-Based Doc2Vec Model [J] . Jian Feng, Ying Zhang, Yuqiang Qiao Journal of Computing and Information Technology . 2020,第1期

机译：使用基于DOM的DOC2VEC模型的网络钓鱼网页的检测方法
4. DOM-Based Print-Link Detection for Web Article Extraction [C] . Sam Liu, Suk-Hwan Lim, Jerry Liu Image and printing in a web 2.0 world II . 2011

机译：用于Web文章提取的基于DOM的打印链接检测
5. Categorization of Phishing Detection Features and Using the Feature Vectors to Classify Phishing Websites [D] . Namasivayam, Bhuvana. 2017

机译：对网络钓鱼检测特征的分类，并使用特征向量对网络钓鱼网站进行分类
6. Machine-learning based feature selection for a non-invasive breathing change detection [O] . Juliana Alves Pegoraro, Sophie Lavault, Nicolas Wattiez, 2021

机译：基于机器学习的非侵入性呼吸变化检测的特征选择
7. Narrowing the Semantic Gap—Improved Text-Based Web Document Retrieval Using Visual Features [O] . 2008

机译：缩小语义鸿沟-使用视觉功能改进基于文本的Web文档检索

Machine-Learning directed Article Detection on the Web using DOM and text-based features

摘要

著录项

相似文献

相关主题

期刊订阅