首页> 外国专利> TOPIC SPECIFIC LANGUAGE MODELS BUILT FROM LARGE NUMBERS OF DOCUMENTS

TOPIC SPECIFIC LANGUAGE MODELS BUILT FROM LARGE NUMBERS OF DOCUMENTS

机译:从大量文档中构建的主题特定语言模型

摘要

Forming and/or improving a language model based on data from a large collection of documents, such as web data. The collection of documents is queried using queries that are formed from the language model. The language model is subsequently improved using the information thus obtained. The improvement is used to improve the query. As data is received from the collection of documents, it is compared to a rejection model, that models what rejected documents typically look like. Any document that meets the test is then rejected. The documents that remain are characterized to determine whether they add information to the language model, whether they are relevant, and whether they should be independently rejected. Rejected documents are used to update the rejection model; accepted documents are used to update the language model. Each iteration improves the language model, and the documents may be analyzed again using the improved language model.
机译:基于来自大量文档的数据(例如Web数据)形成和/或改进语言模型。使用由语言模型形成的查询来查询文档集合。随后使用由此获得的信息来改进语言模型。该改进用于改进查询。当从文档集合中接收到数据时,会将其与拒绝模型进行比较,该模型可以模拟被拒绝文档的典型外观。然后,所有符合测试要求的文件都会被拒绝。保留的文档具有确定它们是否向语言模型添加信息,它们是否相关以及是否应被独立拒绝的特征。拒绝的文档用于更新拒绝模型;接受的文档用于更新语言模型。每次迭代都会改进语言模型,并且可以使用改进的语言模型再次分析文档。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号