首页> 外国专利> Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision

Method and apparatus for inferring the topical content of a document based upon its lexical content without supervision

机译:基于文档的词法内容而无需监督来推断其主题内容的方法和装置

摘要

An iterative method of determining the topical content of a document using a computer. The processing unit of the computer determines the topical content of documents presented to it in machine readable form using information stored in computer memory. That information includes word-clusters, a lexicon, and association strength values. The processing unit beings by generating an observed feature vector for the document being characterized, which indicates which of the words of the lexicon appear in the document. Afterward, the processing unit makes an initial prediction of the topical content of the document in the form of a topic belief vector. The processing unit uses the topic belief vector and the association strength values to predict which words of the lexicon should appear in the document. This prediction is represented via a predicted feature vector. The predicted feature vector is then compared to the observed feature vector to measure how well the topic belief vector models the topical content of the document. If the topic belief vector adequately model the topical content of the document, then the processing unit's task is complete. On the other hand, if the topic belief vector does not adequately model the topical content of the document, then the processing unit determines how the topic belief vector should be modified to improve the prediction of modeling of the topical content.
机译:一种使用计算机确定文档主题内容的迭代方法。计算机的处理单元使用存储在计算机存储器中的信息来确定以机器可读形式呈现给计算机的文档的主题内容。该信息包括单词簇,词典和关联强度值。处理单元通过为被表征的文档生成观察到的特征向量而存在,该特征向量指示词典中的哪个单词出现在文档中。然后,处理单元以主题信念向量的形式对文档的主题内容进行初始预测。处理单元使用主题信念向量和关联强度值来预测词典中的哪些单词应出现在文档中。该预测通过预测特征向量表示。然后将预测的特征向量与观察到的特征向量进行比较,以衡量主题信念向量对文档的主题内容建模的程度。如果主题信念向量足以对文档的主题内容进行建模,那么处理单元的任务就完成了。另一方面,如果主题信念向量不足以对文档的主题内容进行建模,则处理单元确定应如何修改主题信念向量以改善对主题内容建模的预测。

著录项

  • 公开/公告号US5659766A

    专利类型

  • 公开/公告日1997-08-19

    原文格式PDF

  • 申请/专利权人 XEROX CORPORATION;

    申请/专利号US19940307221

  • 发明设计人 MARTI A. HEARST;ERIC SAUND;

    申请日1994-09-16

  • 分类号G06F17/28;G06F17/30;

  • 国家 US

  • 入库时间 2022-08-22 03:09:35

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号