【24h】

On Automatically Tagging Web Documents from Examples

机译:从示例自动标记Web文档

获取原文

摘要

An emerging need in information retrieval is to identify a set of documents conforming to an abstract description. This task presents two major challenges to existing methods of document retrieval and classification. First, similarity based on overall content is less effective because there may be great variance in both content and subject of documents produced for similar functions, e.g. a presidential speech or a government ministry white paper. Second, the function of the document can be defined based on user interests or the specific data set through a set of existing examples, which cannot be described with standard categories. Additionally, the increasing volume and complexity of document collections demands new scalable computational solutions. We conducted a case study using web-archived data from the Latin American Government Documents Archive (LAGDA) to illustrate these problems and challenges. We propose a new hybrid approach based on Naive Bayes inference that uses mixed n-gram models obtained from a training set to classify documents in the corpus. The approach has been developed to exploit parallel processing for large scale data set. The preliminary work shows promising results with improved accuracy for this type of retrieval problem.
机译:信息检索中的新兴需求是识别符合抽象描述的一组文档。此任务对现有的文档检索和分类方法提出了两个主要挑战。首先,基于整体内容的相似性不太有效,因为针对相似功能(例如,文档类型)生成的文档的内容和主题可能存在很大差异。总统演讲或政府部门白皮书。其次,可以基于用户兴趣或通过一组现有示例设置的特定数据来定义文档的功能,而这些示例无法用标准类别来描述。另外,文档收集的数量和复杂性不断增加,需要新的可伸缩计算解决方案。我们使用来自拉丁美洲政府文件档案馆(LAGDA)的网络存档数据进行了案例研究,以说明这些问题和挑战。我们提出了一种基于朴素贝叶斯推理的新混合方法,该方法使用从训练集中获得的混合n-gram模型对语料库中的文档进行分类。已经开发出该方法以利用大规模数据集的并行处理。初步工作显示了针对此类检索问题的具有改进准确性的有希望的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号