首页> 外文期刊>Machine translation >Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays
【24h】

Improving retrieval performance of translation memories using morphosyntactic analyses and generalized suffix arrays

机译:使用形态句法分析和广义后缀数组提高翻译记忆库的检索性能

获取原文
获取原文并翻译 | 示例
           

摘要

Since the 1990s, translation memory (TM) systems have been one of the most widely-used tools in the field of computer-aided translation. However, most of the commercially available systems still consider two segments to be compared similar if the sequence of characters is identical or differ only marginally in spelling. Instead, linguistic similarities are disregarded so that semantically identical or similar segments, which have different (morpho)syntactic structures, are retrieved with a lower similarity value as expected or not at all. The iMem (iMem is an abbreviation for Intelligent Translation Memories) research project aimed at improving the retrieval of those very segments by analyzing their (morpho)syntactic structure and identifying the longest common substrings (LCS) between two sentences by means of generalized suffix arrays. The results of the morphosyntactic analysis are stored in the so-called iMem-TM, an independent relational database which is connected to a commercial, non-linguistically enhanced TM via its API. Base words were used for building the suffixes to increase the probability of finding a larger number of LCS between both sentences. Furthermore, an existing algorithm for generalized suffix arrays was enhanced by an additional array in order to distinguish which suffixes derive from which sentence. In this way, identical repeating LCS within the same sentence are ignored, whereas identical repeating LCS between two different sentences are still considered. If more than one identical repeating LCS between the sentences exist, the best matching LCS is given by calculating positional differences for the identical repeating LCS and choosing the one with the minimal positional difference.
机译:自1990年代以来,翻译记忆库(TM)系统一直是计算机辅助翻译领域中使用最广泛的工具之一。但是,如果字符序列相同或仅在拼写上略有不同,大多数商用系统仍然认为要比较的两个段相似。取而代之的是,不考虑语言上的相似性,以便检索具有不同(词法)句法结构的语义上相同或相似的句段,而该相似性值较低,这与预期的或完全不一样。 iMem(iMem是智能翻译记忆库的缩写)研究项目,旨在通过分析这些句段的(句法)句法结构并通过广义后缀数组识别两个句子之间的最长公共子串(LCS),来改善这些句段的检索。形态句法分析的结果存储在所谓的iMem-TM中,iMem-TM是一个独立的关系数据库,该数据库通过其API与非语言增强的商业TM相连。基本词被用于构建后缀,以增加在两个句子之间找到大量LCS的可能性。此外,现有的通用后缀数组算法通过附加数组进行了增强,以便区分哪个后缀是从哪个句子派生的。这样,同一句子中相同的重复LCS被忽略,而两个不同句子之间的相同重复LCS仍被考虑。如果句子之间存在多个相同的重复LCS,则通过为相同的重复LCS计算位置差并选择位置差最小的LCS来给出最佳匹配的LCS。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号