首页> 外文会议>International ACM SIGIR conference on research development in information retrieval >Predicting Quality Flaws in User-generated Content: The Case of Wikipedia
【24h】

Predicting Quality Flaws in User-generated Content: The Case of Wikipedia

机译:预测用户生成内容中的质量缺陷:以Wikipedia为例

获取原文

摘要

The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content: a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality Haw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10 000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.
机译:在基于用户生成的内容的Web应用程序中,低质量信息的检测和改进是一个关键问题:一个流行的示例是在线百科全书Wikipedia。关于用户生成内容的质量评估的现有研究涉及内容是高质量还是低质量的分类。本文更进一步:针对质量缺陷的预测,以这种方式为需要改进低质量内容的方面提供具体指示。该预测基于用户定义的清除标记,在许多Web应用程序中通常使用该标记来标记具有某些缺点的内容。我们将此方法应用于英语维基百科,英语维基百科是网络上最大,最流行的用户生成知识来源。我们提供了一种自动挖掘方法来识别现有的清除标签,从而为我们提供了带有标签的Wikipedia文章的训练语料库。我们认为,常见的二进制或多类分类方法对于质量缺陷的预测无效,因此将质量Haw预测转换为一类分类问题。我们开发了质量缺陷模型,并采用了专用的机器学习方法来预测Wikipedia最重要的质量缺陷。由于在Wikipedia中,重要测试数据的获取是复杂的,因此我们分析了偏向样本选择的影响。在这方面,我们说明了作为缺陷分布函数的分类器有效性,以应对未知的(现实世界中)特定于缺陷的类不平衡。缺陷预测性能由10000篇Wikipedia文章进行评估,这些文章被标记为十个最常见的质量缺陷:提供的测试数据噪音很小,可以检测到四个精度接近1的缺陷。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号