首页> 外国专利> Using canonical forms to develop a dictionary of names in a text

Using canonical forms to develop a dictionary of names in a text

机译:使用规范形式开发文本中的名称字典

摘要

Descriptive canonical forms of entity types are created by scanning one or more documents in a database of a computer system to identify one or more proper names that appear in the documents as raw names. Each of the raw names has zero or more proper names, zero or more medial substrings, zero or more leading substrings, and zero or more trailing substrings. The raw names of one or more documents are "cleaned" and "split" until certain "cleaning and splitting conditions" are no longer met to obtain a list of clean and split candidate names. Anchor names are selected from the list that unambiguously represent an entity type. The anchor names have one or more entity-type attribute values. Variant names, clean and split candidate names having one or more shared attribute (values) with the anchor name, are combined with the anchor name to create an equivalence group of names that refer to the same entity. A canonical form is generated for the group from a subset of the anchor name attributes. A canonical form is created in this manner for all of the clean and split candidate names on the list.
机译:实体类型的描述性规范形式是通过扫描计算机系统数据库中的一个或多个文档以标识在文档中显示为原始名称的一个或多个专有名称而创建的。每个原始名称都有零个或多个专有名称,零个或多个中间子字符串,零个或多个前导子字符串以及零个或多个尾随子字符串。一个或多个文档的原始名称将被“清理”和“拆分”,直到不再满足某些“清理和拆分条件”以获得清理和拆分候选名称列表为止。从明确表示实体类型的列表中选择锚名称。锚名称具有一个或多个实体类型属性值。变体名称(具有一个或多个具有锚名称的共享属性(值)的候选名称的纯净和拆分的名称)与锚名称结合使用,以创建引用相同实体的等效名称组。从锚名称属性的子集为该组生成规范形式。以这种方式为列表中所有干净的和拆分的候选名称创建规范形式。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号