...
首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >WISDOM: Web intrapage informative structure mining based on document object model
【24h】

WISDOM: Web intrapage informative structure mining based on document object model

机译:WISDOM:基于文档对象模型的Web页内信息结构挖掘

获取原文
获取原文并翻译 | 示例
           

摘要

To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web intrapage informative structure mining based on the document object model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM'S practical applicability.
机译:为了增加页面的商业价值和可访问性,大多数内容站点都倾向于使用站点内冗余信息发布其页面,例如导航面板,广告和版权声明。这样的冗余信息增加了常规搜索引擎的索引大小,并导致页面主题漂移。在本文中,我们研究了在新闻网站中挖掘页内信息结构的问题,以查找和消除冗余信息。请注意,页面内信息结构是原始网页的子集,并且由一组细粒度的信息块组成。新闻网站中页面的页内信息结构仅包含链接到新闻页面或新闻正文的锚点。我们提出了一种称为WISDOM(基于文档对象模型的Web页内信息结构挖掘)的页内信息结构挖掘系统,该系统将信息论应用于DOM树知识以构建结构。 WISDOM将DOM树拆分为许多小子树,并应用自顶向下的信息性块搜索算法来选择一组候选信息性块。通过使用建议的合并方法扩展集合来构建结构。在多个真实新闻网站上进行的实验显示出较高的准确性和召回率,这证实了WISDOM的实际适用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号