TEXT: Automatic Template Extraction from Heterogeneous Web Pages

Kim ChulyunShim Kyuseok

首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >TEXT: Automatic Template Extraction from Heterogeneous Web Pages

【24h】

TEXT: Automatic Template Extraction from Heterogeneous Web Pages

机译：文本：从异构网页中自动提取模板

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.

机译：万维网是最有用的信息源。为了实现高发布效率，通过使用带有内容的通用模板来自动填充许多网站中的网页。模板使读者可以轻松访问以一致的结构为指导的内容。但是，对于计算机而言，模板被认为是有害的，因为由于模板中不相关的术语，模板会降低Web应用程序的准确性和性能。因此，模板检测技术近来已引起很多关注，以提高搜索引擎的性能，群集和Web文档的分类。在本文中，我们提出了新颖的算法，用于从大量从异构模板生成的Web文档中提取模板。我们基于文档中基础模板结构的相似性对Web文档进行聚类，以便同时提取每个聚类的模板。我们开发了一种快速逼近聚类的优良性度量，并提供了对算法的综合分析。与模板检测算法的最新技术相比，我们的真实数据集实验结果证实了我们算法的有效性和鲁棒性。

著录项

来源
《Knowledge and Data Engineering, IEEE Transactions on》 |2011年第4期|p.612-626|共15页
作者
Kim ChulyunShim Kyuseok;
展开▼
作者单位

Seoul National University, Seoul;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
MinHash.; Template extraction; clustering; minimum description length principle;

机译：MinHash;模板提取;聚类;最小描述长度原理;

相似文献

外文文献
中文文献
专利

1. Template Extraction from Heterogeneous Web Pages Using Text Clustering [J] . T.L.N.Divya, G.Loshma, Dr. Nagaratna P Hegde International Journal of Computer Trends and Technology . 2012,第3期

机译：使用文本聚类从异构网页中提取模板
2. Automatic extraction of citations from the text of English-language patents-an example of template mining [J] . Matthew Lawson, Nick Kemp, Michael F. Lynch, Journal of Information Science . 1996,第6期

机译：从英文专利文本中自动提取引文-模板挖掘的示例
3. A Methodology for Template Extraction from Heterogeneous Web Pages [J] . Vidya Kadam, Prakash. R. Devale Indian Journal of Computer Science and Engineering . 2012,第3期

机译：从异构网页中提取模板的方法论
4. Automatic Text Region Extraction using Cluster-based Templates [C] . Eun Yi Kim, Keechul Jung, Ki Young Jeong, International conference on advances in pattern recognition and digital techniques . 2000

机译：使用基于群集的模板提取自动文本区域提取
5. Discipline-Independent Text Information Extraction from Heterogeneously Styled References Using Knowledge from the Web [D] . Park, Sung Hee. 2013

机译：使用Web知识从异构样式引用中提取与学科无关的文本信息
6. Correction to: A pattern learning-based method for temporal expression extraction and normalization from multi-lingual heterogeneous clinical texts [O] . Tianyong Hao, Xiaoyi Pan, Zhiying Gu, 2018

机译：更正为：一种基于模式学习的方法用于从多语言异类临床文本中进行时态表达提取和规范化
7. A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page [O] . Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, 2009

机译：基于快速的基于模板的方法，可自动识别网页的主要文本内容

TEXT: Automatic Template Extraction from Heterogeneous Web Pages

摘要

著录项

相似文献

相关主题

期刊订阅