首页> 外文会议>European conference on machine learning and knowledge discovery in databases >Boot-Strapping Language Identifiers for Short Colloquial Postings
【24h】

Boot-Strapping Language Identifiers for Short Colloquial Postings

机译:简短口语发布的引导语言标识符

获取原文

摘要

There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very 'clean' editorially managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular. In this paper, we propose an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly evaluate the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models. With this work we provide a guide and a publicly available tool to the mining community for language identification on web and social data.
机译:在Web上挖掘大量用户生成的内容引起了极大的兴趣。许多分析技术都依赖于语言,并且依赖于准确的语言标识作为基础。即使已经对语言识别进行了研究,但它集中于非常“干净”的编辑管理的语料库,有限的语言和相对较大的文档。这些不是在Twitter或Facebook帖子中找到的内容的特征,这些内容简短而到处都是白话。在本文中,我们提出了一种基于公开数据的自动化,无监督,可扩展的解决方案。为此,我们彻底评估了使用Wikipedia来建立大量语言(52)和大型语料库的语言标识符,并对大规模用于自动语言识别的最著名算法进行了大规模研究,从而量化了相关性中准确度的变化文件大小,语言(模型)配置文件大小和测试的语言数量。然后,我们展示使用Wikipedia训练直接适用于Twitter的语言标识符的价值。最后,我们通过结合我们的Wikipedia模型和来自推文的位置信息来扩充语言模型并将其自定义为Twitter。这种方法提供了大量自动标记的数据,这些数据充当了引导机制,我们通过经验证明该机制可以提高模型的准确性。通过这项工作,我们为采矿社区提供了一个指南和一个公开可用的工具,用于在网络和社交数据上进行语言识别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号