首页> 外文会议>Annual meeting of the Association for Computational Linguistics;ACL 2012 >Text Segmentation by Language Using Minimum Description Length
【24h】

Text Segmentation by Language Using Minimum Description Length

机译:使用最小描述长度按语言进行的文本分割

获取原文

摘要

The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.
机译:本文所解决的问题是将给定的多语言文档细分为每种语言的细分,然后识别每种细分的语言。该问题是由于试图从网络上收集大量非主要语言的语言数据而引起的。该问题是根据获取文本的最小描述长度来提出的,并且所提出的解决方案通过动态编程来找到各段及其语言。使用《世界人权宣言》和维基百科中的文字,涵盖了200多种语言,为实验提供了证明这种方法潜力的实证结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号