首页> 外国专利> Method and system for normalizing dirty text in a document

Method and system for normalizing dirty text in a document

机译:用于规范文档中脏文字的方法和系统

摘要

A method and system of normalizing dirty text in a document. The present invention creates a thesaurus that evolves over time as new document collections are analyzed. This thesaurus, which is used by an editor, contains standard terms and phrases, and their corresponding variations of these standard terms and phrases. Documents are run through this editor and misspelled words or phrases, joined words, and ad hoc abbreviations are replaced with standard terms from the thesaurus. The present invention also enables normalization of documents in cases where a list of standard terms must be inferred from the corpus of the document. The normalizer will facilitate data mining applications which can not function properly with dirty text, resulting in more accurate analysis of documents. Over time, as the thesaurus evolves, collecting more words and phrases, the process of generating the thesaurus will become more automated.
机译:一种规范文档中脏文本的方法和系统。本发明创建了随着新文档集合的分析而随着时间发展的词库。编辑者使用的该词库包含标准术语和短语,以及这些标准术语和短语的相应变体。文档通过该编辑器运行,拼写错误的单词或短语,连接的单词和临时缩写被同义词库中的标准术语取代。在必须从文档的语料库推断出标准术语列表的情况下,本发明还使文档的规范化成为可能。规范化器将有助于无法使用脏文本正常工作的数据挖掘应用程序,从而使文档分析更加准确。随着时间的流逝,随着词库的发展,收集更多的单词和短语,词库的生成过程将变得更加自动化。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号