Boot-Strapping Language Identifiers for Short Colloquial Postings

机译：简短口语发布的引导语言标识符

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very 'clean' editorially managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular. In this paper, we propose an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly evaluate the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models. With this work we provide a guide and a publicly available tool to the mining community for language identification on web and social data.

机译：在Web上挖掘大量用户生成的内容引起了极大的兴趣。许多分析技术都依赖于语言，并且依赖于准确的语言标识作为基础。即使已经对语言识别进行了研究，但它集中于非常“干净”的编辑管理的语料库，有限的语言和相对较大的文档。这些不是在Twitter或Facebook帖子中找到的内容的特征，这些内容简短而到处都是白话。在本文中，我们提出了一种基于公开数据的自动化，无监督，可扩展的解决方案。为此，我们彻底评估了使用Wikipedia来建立大量语言（52）和大型语料库的语言标识符，并对大规模用于自动语言识别的最著名算法进行了大规模研究，从而量化了相关性中准确度的变化文件大小，语言（模型）配置文件大小和测试的语言数量。然后，我们展示使用Wikipedia训练直接适用于Twitter的语言标识符的价值。最后，我们通过结合我们的Wikipedia模型和来自推文的位置信息来扩充语言模型并将其自定义为Twitter。这种方法提供了大量自动标记的数据，这些数据充当了引导机制，我们通过经验证明该机制可以提高模型的准确性。通过这项工作，我们为采矿社区提供了一个指南和一个公开可用的工具，用于在网络和社交数据上进行语言识别。

著录项

来源
《European conference on machine learning and knowledge discovery in databases》|2013年|95-111|共17页
会议地点
作者
Moises Goldszmidt; Marc Najork; Stelios Paparizos;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Language Identification; Wikipedia; Twitter;

机译：语言识别;维基百科;推特;

相似文献

外文文献
中文文献
专利

1. A Colloquial Language for the Essence of Life: Channelling Energy to Build the Improbable [J] . BioEssays : . 2019,第5期

机译：一种神秘的语言，为生命本质：引导能量来建立不可能的
2. Language, Culture, and Group Membership: An Investigation Into the Social Effects of Colloquial Australian English [J] . Kidd Evan, Kemp Nenagh, Kashima Emiko S., Journal of cross-cultural psychology . 2016,第5期

机译：语言，文化和团体成员身份：对口语澳大利亚英语的社会影响的调查
3. Health Claims Colloquial Language is not a Free Speech [J] . Fleischwirtschaft . 2016,第11期

机译：健康声称口语不是言论自由
4. Boot-Strapping Language Identifiers for Short Colloquial Postings [C] . Moises Goldszmidt, Marc Najork, Stelios Paparizos European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases . 2013

机译：短暂绘制帖子的引导捆绑语言标识符
5. Information Leadership: A Quantitative Analysis of Language across Literature, Position Postings and the Roles that Leaders Play [D] . Atkinson, Brian Lewis. 2018

机译：信息领导力：跨语言，位置发布以及领导者所扮演角色的语言定量分析
6. The Colloquial Faculty for Languages Cerebral Localization and the Nature of Genius [O] . 1887

机译：语言大脑本地化和天才性质的口语学院
7. Boot-Strapping Language Identifiers for Short Colloquial Postings [O] . Moises Goldszmidt, Marc Najork, Stelios Paparizos 2013

机译：简短口语发布的引导语言标识符

Boot-Strapping Language Identifiers for Short Colloquial Postings

摘要

著录项

相似文献

相关主题

期刊订阅