首页> 外文期刊>Journal of wavelet theory and applications >Optimized Template Detection and Extraction Algorithm for Web Scraping of Dynamic Web Pages
【24h】

Optimized Template Detection and Extraction Algorithm for Web Scraping of Dynamic Web Pages

机译:动态网页网页抓取的优化模板检测与提取算法

获取原文
获取原文并翻译 | 示例

摘要

There are a huge number of dynamic websites spanning across the World Wide Web and containing manifold text information that may be helpful for data analysis. Although these dynamic websites generates thousands to millions of web pages for a certain search criteria by user, however, still they represent few unique web templates. Thus identifying the underlying web templates may result in efficient and optimized extraction of information from these web pages. Different techniques such as text_MDL, Minhash jaccard coefficient and Minhash dice coefficient have been used for web template detection and extraction. These path based techniques are complex in nature and not feasible when the number of web documents are very large. A feature based web data extraction algorithm is presented that can identify the underlying template of the web pages represented as DOM tree by clustering the similar web pages together based on feature similarity of web pages and comparing other dynamic web pages with the identified template. It enables processing and extraction of large data sets from heterogeneous web pages in reliable time. We have applied the proposed algorithm on live dynamic web pages of patent portal to compare the result against existing web extraction algorithms and found to be more efficient in terms of throughput.
机译:万维网上有大量动态网站,其中包含多种文本信息,这些信息可能有助于数据分析。尽管这些动态网站会根据用户为特定搜索条件生成数千到数百万个网页,但是它们仍然代表着很少的唯一Web模板。因此,识别基础网页模板可以导致从这些网页中高效且优化地提取信息。诸如text_MDL,Minhash jaccard系数和Minhash骰子系数之类的不同技术已用于Web模板检测和提取。这些基于路径的技术本质上是复杂的,并且当Web文档的数量非常大时不可行。提出了一种基于特征的Web数据提取算法,该算法可以通过基于网页的特征相似度将相似网页聚类在一起并将其他动态网页与所识别的模板进行比较,来识别以DOM树表示的网页的基础模板。它使您能够在可靠的时间内从异构网页处理和提取大型数据集。我们将提出的算法应用于专利门户网站的实时动态网页,以将结果与现有的网页提取算法进行比较,发现在吞吐量方面更加高效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号