...
首页> 外文期刊>SIGKDD explorations >Harnessing the Wisdom of the Crowds for Accurate Web Page Clipping
【24h】

Harnessing the Wisdom of the Crowds for Accurate Web Page Clipping

机译:利用人群的智慧进行准确的网页剪切

获取原文
获取原文并翻译 | 示例
           

摘要

Clipping Web pages, namely extracting the informative clips (areas) from Web pages, has many applications, such as Web printing and e-reading on small handheld devices. Although many existing methods attempt to address this task, most of them can either work only on certain types of Web pages (e.g., news- and blog-like web pages), or perform semi-automatically where extra user efforts are required in adjusting the outputs. The problem of clip- ping any types of Web pages accurately in a totally automatic way remains pretty much open. To this end in this study we harness the wisdom of the crowds to provide accurate recommendation of informative clips on any given Web pages. Specifically, we lever-age the knowledge on how previous users clip similar Web pages, and this knowledge repository can be represented as a transaction database where each transaction contains the clips selected by a user on a certain Web page. Then, we formulate a new pattern mining problem, mining top-1 qualified pattern, on transaction database for this recommendation. Here, the recommendation considers not only the pattern support but also the pattern occupancy (proposed in this work). High support requires that patterns appear frequently in the database, while high occupancy requires that patterns occupy a large portion of the transactions they appear in. Thus, it leads to both precise and complete recommendation. Additionally, we explore the properties on occupancy to further prune the search space for high-efficient pattern mining. Finally, we show the effectiveness of the proposed algorithm on a human-labeled ground truth dataset consisting of 2000 web pages from 100 major Web sites, and demonstrate its efficiency on large synthetic datasets.
机译:剪辑网页,即从网页中提取信息剪辑(区域),具有许多应用程序,例如Web打印和小型手持设备上的电子阅读。尽管许多现有方法尝试解决此任务,但是它们中的大多数只能在某些类型的网页上工作(例如,类似新闻和博客的网页),也可以半自动执行,其中需要额外的用户精力来调整输出。完全自动地精确裁剪任何类型的网页的问题仍然很悬而未决。为此,我们利用人群的智慧在任何给定的网页上提供有关信息片段的准确建议。具体来说,我们利用有关先前用户如何剪辑相似网页的知识,并且该知识库可以表示为事务数据库,其中每个事务都包含用户在某个网页上选择的剪辑。然后,针对此建议,我们在交易数据库上制定了一个新的模式挖掘问题,即挖掘顶级1合格模式。在此,建议不仅考虑模式支持,而且还考虑模式占用率(在这项工作中提出)。高度的支持要求模式频繁出现在数据库中,而高占用率则要求模式占据它们出现的事务的很大一部分。因此,它会导致精确而完整的推荐。此外,我们探索占用率的属性,以进一步缩小搜索空间,以进行高效的模式挖掘。最后,我们在包含100个主要网站的2000个网页组成的带有人类标签的地面事实数据集上证明了该算法的有效性,并证明了其在大型综合数据集上的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号