首页> 外文会议>2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. >Exploiting Attribute Redundancy in Extracting Open Source Forge Websites
【24h】

Exploiting Attribute Redundancy in Extracting Open Source Forge Websites

机译:在提取开源Forge网站中利用属性冗余

获取原文
获取原文并翻译 | 示例

摘要

Open Source Forge (OSF) websites provide information on massive open source software projects, extracting these web data is important for open source research. Traditional extraction methods use string matching among pages to detect page template, which is time-consuming. A recent work published in VLDB exploits redundant entities among websites to detect web page coordinates of these entities. The experiment gives good results when these coordinates are used for extracting other entities of the target site. However, OSF websites have few redundant project entities. This paper proposes a modified version of that redundancy-based method tailored for OSF websites, which relies on a similar yet weaker presumption that entity attributes are redundant rather than whole entities. Like the previous work, we also construct a seed database to detect web page coordinates of the redundancies, but all at the attribute-level. In addition, we apply attribute name verification to reduce false positives during extraction. The experiment result indicates that our approach is competent in extracting OSF websites, in which scenario the previous method can not be applied.
机译:开源Forge(OSF)网站提供有关大规模开源软件项目的信息,提取这些Web数据对于开源研究非常重要。传统的提取方法使用页面之间的字符串匹配来检测页面模板,这很费时。 VLDB中发表的最新著作利用网站中的冗余实体来检测这些实体的网页坐标。当这些坐标用于提取目标位置的其他实体时,实验会给出良好的结果。但是,OSF网站几乎没有多余的项目实体。本文提出了一种针对OSF网站量身定制的基于冗余的方法的修改版本,该方法基于一个相似但较弱的假设,即实体属性是冗余而不是整个实体。像以前的工作一样,我们还构建了一个种子数据库来检测冗余的网页坐标,但这些都是在属性级别上进行的。此外,我们应用属性名称验证来减少提取过程中的误报。实验结果表明,我们的方法能够胜任OSF网站的提取,在这种情况下,不能采用以前的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号