首页> 外国专利> System and method of automatic discovery of terms in a document that are relevant to a given target topic

System and method of automatic discovery of terms in a document that are relevant to a given target topic

机译:自动发现文档中与给定目标主题相关的术语的系统和方法

摘要

A computer program product is provided as an automatic mining system to discover terms that are relevant to a given target topic from a large databases of unstructured information such as the World Wide Web. The operation of the automatic mining system is performed in three stages: The first stage is carried out by a new terms discoverer for discovering the terms in a document, the second stage is carried out by a candidate terms discoverer for discovering potentially relevant terms, and the third stage is carried out by a relevant terms discoverer for refining or testing the discovered relevance to filter false relevance. The new terms discoverer includes a system for the automatic mining of patterns and relations, a system for the automatic mining of new relationships, and a system for selecting new terms from relations. In one embodiment, the system for the automatic mining of patterns and relations identifies a set of related terms on the WWW with a high degree of confidence, using a duality concept, and includes a terms database and two identifiers: a relation identifier and a pattern identifier. The system for the automatic mining of new relationships includes a database a knowledge module and a statistics module. The knowledge module includes a stemming unit, a synonym check unit, and a domain knowledge check unit. The candidate terms discoverer includes a metadata extractor, a document vector module, an association module, a filtering module, and a database. The relevant terms discoverer includes a stop word filter and a system for the automatic construction of generalization—specialization hierarchy of terms comprised of a terms database, an augmentation module, a generalization detection module, and a hierarchy database.
机译:提供一种计算机程序产品作为自动挖掘系统,以从大型非结构化信息数据库(例如,万维网)中发现与给定目标主题相关的术语。自动挖掘系统的操作分为三个阶段:第一阶段由新的术语发现器执行,以发现文档中的术语;第二阶段由候选术语发现器执行,以发现潜在的相关术语;以及第三阶段由相关术语发现者执行,以完善或测试发现的相关性以过滤错误的相关性。新术语发现器包括一个用于自动挖掘模式和关系的系统,一个用于自动挖掘新关系的系统以及一个用于从关系中选择新术语的系统。在一个实施例中,用于自动挖掘模式和关系的系统使用对偶概念以高置信度在WWW上标识一组相关术语,并且包括术语数据库和两个标识符:关系标识符和模式标识符。用于自动挖掘新关系的系统包括数据库,知识模块和统计模块。知识模块包括词干提取单元,同义词检查单元和领域知识检查单元。候选术语发现器包括元数据提取器,文档向量模块,关联模块,过滤模块和数据库。相关术语发现器包括停用词过滤器和用于术语的归纳化-专业化层次结构的自动构建的系统,该系统由术语数据库,扩充模块,归纳化检测模块和层次结构数据库组成。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号