qRead: A fast and accurate article extraction method from web pages using partition features optimizations

机译：qRead：使用分区功能优化从网页中快速，准确地提取文章的方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.

机译：我们提出了一种称为qRead的新方法，可以实现从网页中实时提取高精度内容。内容提取的早期方法包括经验过滤规则，文档对象模型（DOM）树和机器学习模型。这些方法虽然取得了一定的成功，但可能无法满足高精度实时提取的要求。例如，在复杂的网页上构建DOM树非常耗时，而使用机器学习模型会使事情变得不必要地更加复杂。与以前的方法不同，qRead使用段密度和相似度来标识主要内容。特别是，qRead首先会过滤明显的垃圾内容，消除HTML标签，并将剩余的文本划分为自然段。然后，它使用段中行数最高的单词比率以及段和标题之间的相似性来标识主要内容。我们显示，通过广泛的实验，qRead在中文网页上的准确度达到96.8％，平均提取时间为13.20毫秒，在英文网页上的准确度为93.6％，平均提取时间为11.37毫秒，从而大大提高了准确性超越以前的方法并满足实时提取要求。

著录项

来源
《2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management》|2015年|364-371|共8页
会议地点 Lisbon(PT)
作者
Jingwen Wang; Jie Wang;
展开▼
作者单位

Department of Computer Science, University of Massachusetts, Lowell, 01854, U.S.A.;

Department of Computer Science, University of Massachusetts, Lowell, 01854, U.S.A.;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Web pages; HTML; Feature extraction; Real-time systems; Navigation; Layout;

机译：网页; HTML;特征提取;实时系统;导航;布局;

相似文献

外文文献
中文文献
专利

1. Fast and accurate parameter extraction for different types of fuel cells with decomposition and nature-inspired optimization method [J] . Gong Wenyin, Yan Xuesong, Hu Chengyu, Energy Conversion & Management . 2018,第OCTa期

机译：通过分解和自然优化方法快速，准确地提取不同类型燃料电池的参数
2. Fast and accurate side-chain topology and energy refinement (FASTER) as a new method for protein structure optimization. [J] . Desmet J, Spriet J, Lasters I Proteins: Structure, Function, and Genetics . 2002,第1期

机译：快速准确的侧链拓扑结构和能量精炼（FASTER）作为蛋白质结构优化的新方法。
3. Text Categorization Optimization By A Hybrid Approach Using Multiple Feature Selection And Feature Extraction Methods [J] . K. Rajeswari, Sneha Nakil, Neha Patil, International Journal of Engineering Research and Applications . 2014,第5期

机译：基于多种特征选择和特征提取的混合方法文本分类优化
4. qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations [C] . Jingwen Wang, Jie Wang International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management . 2015

机译：QREAD：使用分区的Web页面提供快速准确的文章提取方法优化
5. Methods for faster feature matching using the scale-invariant feature transform. [D] . Treen, Geoffrey. 2010

机译：使用比例不变特征变换的快速特征匹配方法。
6. A Fast Learning Method for Accurate and Robust Lane Detection Using Two-Stage Feature Extraction with YOLO v3 [O] . Xiang Zhang, Wei Yang, Xiaolin Tang, 2018

机译：利用YOLO v3进行两阶段特征提取的快速准确而可靠的车道检测方法
7. A Fast and Accurate Multi-level B-Spline Approximation with Adaptive Lattice Partitioning and Subregions Transformation Methods [O] . Masataka SEO, Yen-Wei CHEN 2011

机译：具有自适应晶格分区和子区域变换方法的快速准确的多级B样条近似

qRead: A fast and accurate article extraction method from web pages using partition features optimizations

摘要

著录项

相似文献

相关主题

期刊订阅