首页> 外国专利> A METHOD FOR DIACRITISATION OF TEXTS WRITTEN IN LATIN- OR CYRILLIC-DERIVED ALPHABETS

A METHOD FOR DIACRITISATION OF TEXTS WRITTEN IN LATIN- OR CYRILLIC-DERIVED ALPHABETS

机译:拉丁字母或西里字母缩写文字的数字化方法

摘要

The presented invention is related to the method for the recovery of diacritical marks in texts written in any of the languages using Latin- or Cyrillic-derived alphabets with diacritical marks. The embodiment of the invention presented in this document uses multiple information sources (topical information and information on semantic proximity) in the task of diacritisation, which is recognised and treated as a classification task. The invention relies on classification based on topical information provided by text categorisation, the information on semantic proximity of particular words in the text, as well as morphological information. At word level the classification task is limited to the calculation of the semantic score of each particular word interpretation. The actual recovery of the diacritical marks is carried out only at the sentence level (or possibly some higher level such as paragraph or the entire text), with the assumption that users, when adapting a text to a non-diacritised setting, consistently use one of the existing conventions, rather than switching between different conventions.
机译:本发明涉及使用拉丁字母或西里尔字母衍生的带有变音标记的字母来恢复以任何一种语言编写的文本中的变音标记的方法。在该文件中提出的本发明的实施例在二元化的任务中使用了多个信息源(主题信息和关于语义邻近性的信息),其被识别并视为分类任务。本发明依赖于基于由文本分类提供的主题信息,关于文本中特定单词的语义接近度的信息以及形态信息的分类。在单词级别,分类任务仅限于每个特定单词解释的语义分数的计算。变音标记的实际恢复仅在句子级别(或可能在更高的级别,例如段落或整个文本)上进行,并且假设用户在将文本改编为非变通设置时始终使用一个而不是在不同的约定之间切换。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号