首页> 外文会议>Hawaii International Conference on System Sciences, Annual >Linguini: Language Identification for Multilingual Documents
【24h】

Linguini: Language Identification for Multilingual Documents

机译:Linguini:多语种文件的语言识别

获取原文

摘要

We present in this paper Linguini, a vector-space based categorize tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration, and the features used. We found that Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy. We also present how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond the monolingual analysis. This approach can be applied to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
机译:我们在本文中展示了一个用于高精度语言识别的矢量空间基于分类的分类。我们展示了如何准确性取决于输入文档的大小,所考虑的语言集以及所使用的功能。我们发现,Linguini可以将文件的语言识别为5-10%的平均Web文档的5%,精度100%。我们还提出了如何确定文档是否有两种或多种语言,并且在什么比例中,而不是在单机分析之外产生任何明显的计算开销。这种方法可以应用于对象分类系统,以区分案例,当系统推荐两个或多个类别时,该文档强烈地属于全部或真正的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号