Linguini: Language Identification for Multilingual Documents

机译：Linguini：多语种文件的语言识别

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present in this paper Linguini, a vector-space based categorize tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration, and the features used. We found that Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy. We also present how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond the monolingual analysis. This approach can be applied to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

机译：我们在本文中展示了一个用于高精度语言识别的矢量空间基于分类的分类。我们展示了如何准确性取决于输入文档的大小，所考虑的语言集以及所使用的功能。我们发现，Linguini可以将文件的语言识别为5-10％的平均Web文档的5％，精度100％。我们还提出了如何确定文档是否有两种或多种语言，并且在什么比例中，而不是在单机分析之外产生任何明显的计算开销。这种方法可以应用于对象分类系统，以区分案例，当系统推荐两个或多个类别时，该文档强烈地属于全部或真正的。

著录项

来源
《Hawaii International Conference on System Sciences, Annual》|1999年||共11页
会议地点
作者
John M. Prager;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 N94-53;
关键词

相似文献

外文文献
中文文献
专利

1. Query Expansion in Resource-Scarce Languages: A Multilingual Framework Utilizing Document Structure [J] . Arjun Atreya V, Ashish Kankaria, Pushpak Bhattacharyya, ACM transactions on Asian language information processing . 2017,第2期

机译：资源稀缺语言中的查询扩展：利用文档结构的多语言框架
2. Query Expansion in Resource-Scarce Languages: A Multilingual Framework Utilizing Document Structure [J] . Atreya Arjun V, Kankaria Ashish, Bhattacharyya Pushpak, SIAM journal on applied dynamical systems . 2017,第2期

机译：在资源稀缺语言中查询扩展：利用文档结构的多语言框架
3. Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification [J] . Basu Joyanta, Khan Soma, Roy Rajib, Circuits, systems and signal processing . 2021,第10期

机译：用于扬声器和语言识别的低资源东部和东北印度语言语言的多语种演讲语料库
4. Linguini: language identification for multilingual documents [C] . Prager J.M. System Sciences, 1999. HICSS-32 . 1999

机译：Linguini：多语言文档的语言识别
5. International Multilingual Student Writers' (Re)negotiation of Their Languages and Literacies Practices in a First-Year Multilingual Composition Class. [D] . Prikhodko, Maria Y. 2017

机译：国际多语言学生作家在一年级的多语言作文课上对其语言和文学实践的（重新）谈判。
6. Intensity of Multilingual Language Use Predicts Cognitive Performance in Some Multilingual Older Adults [O] . Anna Pot, Merel Keijzer, Kees de Bot 2018

机译：使用多种语言的强度可以预测一些使用多种语言的成年人的认知能力
7. Linguini: Language Identification for Multilingual Documents [O] . John M. Prager 1999

机译：Linguini：多语言文档的语言识别

Linguini: Language Identification for Multilingual Documents

摘要

著录项

相似文献

相关主题

期刊订阅