首页> 外文会议>International Conference on Web Information Systems Engineering(WISE 2004); 20041122-24; Brisbane(AU) >A Two-Phase Sampling Technique to Improve the Accuracy of Text Similarities in the Categorisation of Hidden Web Databases
【24h】

A Two-Phase Sampling Technique to Improve the Accuracy of Text Similarities in the Categorisation of Hidden Web Databases

机译:在隐藏的Web数据库分类中提高文本相似性准确性的两阶段采样技术

获取原文
获取原文并翻译 | 示例

摘要

The larger amount of high quality and specialised information on the Web is stored in document databases, which is not indexed by general-purpose search engines such as Google and Yahoo. Such information is dynamically generated as a result of submitting queries to databases — which are referred to as Hidden Web databases. This paper presents a Two-Phase Sampling (2PS) technique that detects Web page templates from the randomly sampled documents of a database. It generates terms and frequencies that summarise the database content with improved accuracy. We then utilise such statistics to improve the accuracy of text similarity computation in categorisation. Experimental results show that 2PS effectively eliminates terms contained in Web page templates, and generates terms and frequencies with improved accuracy. We also demonstrate that 2PS improves the accuracy of text similarity computation required in the process of database categorisation.
机译:Web上大量的高质量和专业信息存储在文档数据库中,而通用搜索引擎(例如Google和Yahoo)则不会对其进行索引。通过将查询提交到数据库(称为隐藏Web数据库),可以动态生成此类信息。本文提出了一种两阶段采样(2PS)技术,该技术从数据库的随机采样文档中检测Web网页模板。它会生成术语和频率,以更高的精度汇总数据库内容。然后,我们利用这些统计数据来提高分类中文本相似度计算的准确性。实验结果表明,2PS有效地消除了网页模板中包含的术语,并以更高的精度生成了术语和频率。我们还证明了2PS可以提高数据库分类过程中所需的文本相似度计算的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号