首页> 外文期刊>Knowledge and Information Systems >String alignment for automated document versioning
【24h】

String alignment for automated document versioning

机译:字符串对齐,用于自动文档版本控制

获取原文
获取原文并翻译 | 示例

摘要

The automated analysis of documents is an important task given the rapid increase in availability of digital texts. Automatic text processing systems often encode documents as vectors of term occurrence frequencies, a representation which facilitates the classification and clustering of documents. Historically, this approach derives from the related field of data mining, where database entries are commonly represented as points in a vector space. While this lineage has certainly contributed to the development of text processing, there are situations where document collections do not conform to this clustered structure, and where the vector representation may be unsuitable for text analysis. As a proof-of-concept, we had previously presented a framework where the optimal alignments of documents could be used for visualising the relationships within small sets of documents. In this paper we develop this approach further by using it to automatically generate the version histories of various document collections. For comparison, version histories generated using conventional methods of document representation are also produced. To facilitate this comparison, a simple procedure for evaluating the accuracy of the version histories thus generated is proposed.
机译:鉴于数字文本的可用性迅速增加,文档的自动分析是一项重要的任务。自动文本处理系统通常将文档编码为术语出现频率的向量,这种表示有助于文档的分类和聚类。从历史上看,这种方法源自数据挖掘的相关领域,在该领域中,数据库条目通常表示为向量空间中的点。尽管此谱系无疑促进了文本处理的发展,但在某些情况下文档集合不符合此聚类结构,并且矢量表示可能不适合文本分析。作为概念验证,我们之前已经提出了一个框架,在该框架中,可以使用文档的最佳对齐方式来可视化小文档集中的关系。在本文中,我们通过使用该方法自动生成各种文档集合的版本历史来进一步开发此方法。为了进行比较,还生成了使用常规文档表示方法生成的版本历史记录。为了促进这种比较,提出了一种用于评估由此生成的版本历史的准确性的简单过程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号