首页> 外文期刊>ACM transactions on Asian language information processing >Revisiting Document Length Hypotheses: A Comparative Study of Japanese Newspaper and Patent Retrieval
【24h】

Revisiting Document Length Hypotheses: A Comparative Study of Japanese Newspaper and Patent Retrieval

机译:再谈文献长度假说:日本报纸和专利检索的比较研究

获取原文
获取原文并翻译 | 示例
       

摘要

NTCIR-4 experiments of the CLIR J-J (Japanese monolingual newspaper retrieval) and patent tasks are described, focusing on comparative studies of two test collections and two retrieval approaches in view of document length hypotheses. TF~*IDF outperformed the language modeling approach in the CLIR J-J task whereas the language modeling approach performed better in the patent task. Two different document length hypotheses behind two tasks/collections are assumed by analyzing document length distributions of relevant/retrieved documents in the NTCIR-3 and -4 collections. Given these hypotheses, TF~*IDF is easily adapted to patent retrieval tasks. Document length prior probabilities are applied to the language modeling approach. For the patent task, task-specific techniques, such as IPC priors and different indexing strategies, are evaluated and reported. To facilitate retrieval from large patent collections, a simple distributed search strategy is applied and found to be efficient, despite a slight deterioration of effectiveness. We found that TF~*IDF performed similarly to the language modeling runs against the patent collection by controlling the document length normalization, whereas the language modeling approach does not perform as well as TF~*IDF, despite calibration against the CLIR J-J collection. The different characteristics of the document lengths of the two test collections are illustrated through comparative studies.
机译:描述了CLIR J-J(日本单语报纸检索)和专利任务的NTCIR-4实验,着眼于基于文档长度假设的两个样本集和两种检索方法的比较研究。 TF〜* IDF在CLIR J-J任务中胜过语言建模方法,而在专利任务中语言建模方法表现更好。通过分析NTCIR-3和-4集合中相关/已检索文件的文件长度分布,可以假定两个任务/集合后面有两个不同的文件长度假设。鉴于这些假设,TF〜* IDF很容易适应专利检索任务。文档长度先验概率被应用于语言建模方法。对于专利任务,评估和报告特定于任务的技术,例如IPC先验技术和不同的索引策略。为了促进从大型专利馆藏中检索,尽管效率略有下降,但采用了一种简单的分布式搜索策略,该策略很有效。我们发现,通过控制文档长度规范化,TF〜* IDF的执行与语言建模类似,对专利收集产生了不利影响,而尽管对CLIR J-J收集进行了校准,但语言建模方法的表现却不及TF〜* IDF。通过比较研究说明了两个测试集的文档长度的不同特征。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号