A Two-Phase Sampling Technique to Improve the Accuracy of Text Similarities in the Categorisation of Hidden Web Databases

机译：在隐藏的Web数据库分类中提高文本相似性准确性的两阶段采样技术

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The larger amount of high quality and specialised information on the Web is stored in document databases, which is not indexed by general-purpose search engines such as Google and Yahoo. Such information is dynamically generated as a result of submitting queries to databases — which are referred to as Hidden Web databases. This paper presents a Two-Phase Sampling (2PS) technique that detects Web page templates from the randomly sampled documents of a database. It generates terms and frequencies that summarise the database content with improved accuracy. We then utilise such statistics to improve the accuracy of text similarity computation in categorisation. Experimental results show that 2PS effectively eliminates terms contained in Web page templates, and generates terms and frequencies with improved accuracy. We also demonstrate that 2PS improves the accuracy of text similarity computation required in the process of database categorisation.

机译：Web上大量的高质量和专业信息存储在文档数据库中，而通用搜索引擎（例如Google和Yahoo）则不会对其进行索引。通过将查询提交到数据库（称为隐藏Web数据库），可以动态生成此类信息。本文提出了一种两阶段采样（2PS）技术，该技术从数据库的随机采样文档中检测Web网页模板。它会生成术语和频率，以更高的精度汇总数据库内容。然后，我们利用这些统计数据来提高分类中文本相似度计算的准确性。实验结果表明，2PS有效地消除了网页模板中包含的术语，并以更高的精度生成了术语和频率。我们还证明了2PS可以提高数据库分类过程中所需的文本相似度计算的准确性。

著录项

来源
《International Conference on Web Information Systems Engineering(WISE 2004); 20041122-24; Brisbane(AU)》|2004年|P.516-527|共12页
会议地点 Brisbane(AU)
作者
Yih-Ling Hedley; Muhammad Younas; Anne James; Mark Sanderson;
展开▼
作者单位

School of Mathematical and Information Sciences, Coventry University, Priory Street, Coventry CV1 5FB, UK;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词

相似文献

外文文献
中文文献
专利

1. CCReSD: concept-based categorisation of Hidden Web databases [J] . Yih-Ling Hedley, Muhammad Younas, Anne James International Journal of High Performance Computing and Networking . 2007,第1a2期

机译：CCReSD：隐藏Web数据库的基于概念的分类
2. Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications [J] . Tom Magerman, Bart Van Looy, Xiaoyan Song Scientometrics . 2010,第2期

机译：探索基于潜在语义分析的文本挖掘技术的可行性和准确性，以检测专利文献与科学出版物之间的相似性
3. Avoidance of Ranking Capabilities in Retrieval of Queries on Hidden-Web Text Databases [J] . S K.Rubeena, T. Srinivasa Rao International Journal of Engineering Research and Applications . 2013,第5期

机译：避免在隐藏的Web文本数据库中检索查询中的排名功能
4. A Two-Phase Sampling Technique to Improve the Accuracy of Text Similarities in the Categorisation of Hidden Web Databases [C] . Yih-Ling Hedley, Muhammad Younas, Anne James, International Conference on Web Information Systems Engineering(WISE 2004) . 2004

机译：一种两相采样技术，提高隐藏Web数据库分类中文本相似性的准确性
5. Classifying and searching hidden-web text databases. [D] . Ipeirotis, Panagiotis G. 2004

机译：分类和搜索隐藏Web文本数据库。
6. Improving low-accuracy protein structures using enhanced sampling techniques [O] . Tianwu Zang, Tianqi Ma, Qinghua Wang, -1

机译：使用增强的采样技术改善低准确性的蛋白质结构
7. A Two-Phase Sampling Technique for Information Extraction from Hidden Web Databases [O] . Y. L. Hedley, M. Younas, A. James 2004

机译：一种从隐藏Web数据库中提取信息的两阶段抽样技术
8. Improved Fluid Dynamics Similarity, Analysis and Verification. Part 3 - Two-Phase Flow in Vibrating Discharge Lines Final Report, 29 Jun. 1965 - 28 Jun. 1968 [R] . Griebe, R. W., Schoenhals, R. J., Winter, E. R. 1968

机译：改进的流体动力学相似性，分析和验证。第3部分 - 振动排放管路中的两相流最终报告，1965年6月29日 - 1968年6月28日

A Two-Phase Sampling Technique to Improve the Accuracy of Text Similarities in the Categorisation of Hidden Web Databases

摘要

著录项

相似文献

相关主题

期刊订阅