Near-Duplicates Detection for Vietnamese Documents In Large Database

机译：大型数据库中越南文档的近副本检测

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Near-duplicate documents exacerbate the problem of information overload. Research in detecting near-duplicates has attracted a lot of attention from both industry and academia. In this paper, we focus on addressing this problem for Vietnamese documents which, to the best of our knowledge, has not been done before. Most of the current algorithms have been designed for English which are not directly applicable to Vietnamese - a monosyllabic language. We propose to combine Charikar's algorithm [2] with a "weighting scheme" and Vietnamese specific features to address the language intricacy. Experimental results indicate that our scheme is effective for detecting near-duplicates in a corpus of Vietnamese documents.

机译：近重复文件加剧了信息过载问题。检测近复制品的研究吸引了行业和学术界的大量关注。在本文中，我们专注于解决越南文件的这个问题，这据我们所知，尚未以前做过。大多数当前算法设计用于英语，这些算法不直接适用于越南语 - 单音族语言。我们建议将Charikar的算法[2]与“加权方案”和越南特定功能结合起来以解决语言复杂性。实验结果表明，我们的计划对于检测越南文献的语料库中的近双药物是有效的。

著录项

来源
《International Conference on Advanced Language Processing and Web Information Technology》|2008年||共6页
会议地点
作者
Cong Thanh Truong; The Duy Bui; Bao Son Pham;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP312-53;
关键词
Charikar; LSH; Near-duplicate Vietnamese detection; Weighting scheme; Hash scheme;

机译：Charikar;LSH;近副复制越南检测;加权方案;哈希方案;

相似文献

外文文献
中文文献
专利

1. Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection [J] . Phuc-TranHo, Sung-RyulKim International Journal of Distributed Sensor Networks . 2014,第3期

机译：基于指纹的近重复文档检测及其在SNS垃圾邮件检测中的应用
2. Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL [J] . Ercan Canhasi Journal of computer sciences . 2018,第5期

机译：通过OpenCL在几乎重复的文档检测中评估CPU，GPU和FPGA的效率
3. Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL [J] . Canhasi Ercan Journal of computer sciences . 2018,第5期

机译：通过OpenCL在几乎重复的文档检测中评估CPU，GPU和FPGA的效率
4. Near-Duplicates Detection for Vietnamese Documents In Large Database [C] . Cong Thanh Truong, The Duy Bui, Bao Son Pham International Conference on Advanced Language Processing and Web Information Technology . 2008

机译：大型数据库中越南文档的近副本检测
5. Layered hypervideo document database system: Design and modeling of hypervideo document database [D] . Yoon, Kyoungro 1999

机译：分层超视频文档数据库系统：超视频文档数据库的设计和建模
6. Large expert-curated database for benchmarking document similarity detection in biomedical literature search [O] . Peter Brown, RELISH Consortium, Yaoqi Zhou -1

机译：大型专家管理的数据库用于对生物医学文献搜索中的基准文件相似性进行检测
7. XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning [O] . Pamulaparty Lavanya, Guru Rao C.V., Rao M. Sreenivasa 2015

机译：XNDDF：建立一种使用监督和无监督学习的灵活的近重复文档检测框架
8. Building Vietnamese Herbal Database Towards Big Data Science in Nature-Based Medicine. [R] . Le, L. T. 2018

机译：在自然医学中建立越南草药数据库走向大数据科学。

Near-Duplicates Detection for Vietnamese Documents In Large Database

摘要

著录项

相似文献

相关主题

期刊订阅