Predicting Quality Flaws in User-generated Content: The Case of Wikipedia

机译：预测用户生成内容中的质量缺陷：以Wikipedia为例

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content: a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality Haw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10 000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.

机译：在基于用户生成的内容的Web应用程序中，低质量信息的检测和改进是一个关键问题：一个流行的示例是在线百科全书Wikipedia。关于用户生成内容的质量评估的现有研究涉及内容是高质量还是低质量的分类。本文更进一步：针对质量缺陷的预测，以这种方式为需要改进低质量内容的方面提供具体指示。该预测基于用户定义的清除标记，在许多Web应用程序中通常使用该标记来标记具有某些缺点的内容。我们将此方法应用于英语维基百科，英语维基百科是网络上最大，最流行的用户生成知识来源。我们提供了一种自动挖掘方法来识别现有的清除标签，从而为我们提供了带有标签的Wikipedia文章的训练语料库。我们认为，常见的二进制或多类分类方法对于质量缺陷的预测无效，因此将质量Haw预测转换为一类分类问题。我们开发了质量缺陷模型，并采用了专用的机器学习方法来预测Wikipedia最重要的质量缺陷。由于在Wikipedia中，重要测试数据的获取是复杂的，因此我们分析了偏向样本选择的影响。在这方面，我们说明了作为缺陷分布函数的分类器有效性，以应对未知的（现实世界中）特定于缺陷的类不平衡。缺陷预测性能由10000篇Wikipedia文章进行评估，这些文章被标记为十个最常见的质量缺陷：提供的测试数据噪音很小，可以检测到四个精度接近1的缺陷。

著录项

来源
《International ACM SIGIR conference on research development in information retrieval》|2012年|981-990|共10页
会议地点 Portland OR(US)
作者
Maik Anderka; Benno Stein; Nedim Lipka;
展开▼
作者单位

Bauhaus-Universitaet Weimar 99421 Weimar Germany;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
User-generated Content Analysis; Information Quality; Wikipedia; Quality Flaw Prediction; One-class Classification;

机译：用户生成的内容分析；信息质量；维基百科;质量缺陷预测；一类分类;

相似文献

外文文献
中文文献
专利

1. Interactive Quality Analytics of User-generated Content: An Integrated Toolkit for the Case of Wikipedia [J] . CECILIA DI SCIASCIO, DAVID STROHMAIER, MARCELO ERRECALDE, ACM Transactions on Interactive Intelligent Systems . 2019,第2a3期

机译：用户生成内容的交互式质量分析：Wikipedia案例的集成工具包
2. Interactive Quality Analytics of User-generated Content: An Integrated Toolkit for the Case of Wikipedia [J] . CECILIA DI SCIASCIO, DAVID STROHMAIER, MARCELO ERRECALDE, ACM Transactions on Interactive Intelligent Systems . 2019,第2a3期

机译：用户生成的内容的互动优质分析：用于维基百科的案例的集成工具包
3. Quality flaw prediction in Spanish Wikipedia: A case of study with verifiability flaws [J] . Ferretti Edgardo, Cagnina Leticia, Paiz Viviana, Information Processing & Management . 2018,第6期

机译：西班牙维基百科中的质量缺陷预测：带有可验证性缺陷的研究案例
4. Predicting Quality Flaws in User-generated Content: The Case of Wikipedia [C] . Maik Anderka, Benno Stein, Nedim Lipka International ACM SIGIR conference on research development in information retrieval . 2012

机译：预测用户生成内容中的质量缺陷：维基百科的情况
5. Multilingual Knowledge Production and Dissemination in Wikipedia: A Spatial Narrative Analysis of the Collaborative Construction of City-Related Articles Within the User-Generated Encyclopaedia [D] . Jones, Henry A. 2017

机译：维基百科的多语言知识生产和传播：在用户生成的百科全书地区与城市相关文章协作建设的空间叙述分析
6. Using Gross Energy Improves Metabolizable Energy Predictive Equations for Pet Foods Whereas Undigested Protein and Fiber Content Predict Stool Quality [O] . Jean A. Hall, Lynda D. Melendez, Dennis E. Jewell 2010

机译：使用总能量可改善宠物食品的代谢能预测方程，而未消化的蛋白质和纤维含量可预测粪便质量
7. Analyzing and Predicting Quality Flaws in User-generated Content: The Case of Wikipedia [O] . Anderka Maik 2013

机译：分析和预测用户生成内容中的质量缺陷：以Wikipedia为例

Predicting Quality Flaws in User-generated Content: The Case of Wikipedia

摘要

著录项

相似文献

相关主题

期刊订阅