首页> 外文会议> >Automatic Term Extraction from Newspaper Corpora: Making the Most of Specificity and Common Features
【24h】

Automatic Term Extraction from Newspaper Corpora: Making the Most of Specificity and Common Features

机译:从报纸语料库中自动提取术语:充分利用特异性和共同特征

获取原文

摘要

The first step of any terminological work is to setup a reliable, specialized corpus composed of documents written by specialists and then to apply automatic term extraction (ATE) methods to this corpus in order to retrieve a first list of potential terms. In this paper, the experiment we describe differs from this usual process. The corpus used for this study was built from newspaper articles retrieved from the Web using a short list of keywords. The general intuition on which this research is based is that ATE based corpus comparison techniques can be used to capture both similarities and dissimilarities between corpora. The former are exploited through a termhood measure and the latter through word embeddings. Our initial results were validated manually and show that combining a traditional ATE method that focuses on dissimilarities between corpora to newer methods that exploit similarities (more specifically distributional features of candidates) leads to promising results.
机译:任何术语工作的第一步都是建立一个由专家撰写的文档组成的可靠,专业的语料库,然后对该语料库应用自动术语提取(ATE)方法以检索潜在术语的第一列表。在本文中,我们描述的实验与此通常的过程有所不同。本研究使用的语料库是使用简短的关键字列表从网上检索的报纸文章中构建的。这项研究所基于的一般直觉是基于ATE的语料库比较技术可用于捕获语料库之间的相似性和异同性。前者通过术语测量来利用,而后者则通过词嵌入来利用。我们的初步结果已通过人工验证,结果表明,结合侧重于语料库之间差异的传统ATE方法与利用相似性(更具体而言是候选人的分布特征)的较新方法相结合,可产生可喜的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号