'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation

Amjad Maaz; Sidorov Grigori; Zhila Alisa; Gomez-Adorno Helena; Voronkov Ilia; Gelbukh Alexander

首页> 外文期刊>Journal of intelligent & fuzzy systems: Applications in Engineering and Technology >'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation

【24h】

'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation

机译：“弯曲真相”：乌尔都语语言中假新闻检测的基准数据集及其评估

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1(Fake) and 0.90 F1(Real). We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.

机译：本文介绍了乌尔都语语言中的假新闻检测的新语料库，以及基线分类及其评价。随着世界范围内的升级和通过削弱信息的可用性产生的影响力，在各种语言中快速识别数字媒体中的假新闻的挑战变得更加尖锐。我们提供一个手动组装和验证的数据集，其中包含了900个新闻文章，500个作为Real和400作为假，允许调查Urdu的自动假新闻检测方法。真实子集中的新闻文章来自合法的新闻来源，并手动验证了他们的有效性。在假子集中，通过招聘乌尔都语的雇员名人来解决了寻找假新闻的已知难度，被指示故意写出欺骗性新闻文章。 DataSet包含5个不同的主题：（i）业务，（ii）健康，（iii）showbiz，（iv）运动和（v）技术。要将我们的URDU数据集作为基准，我们执行了基准分类。我们制作了各种文本表示功能集，包括单词n-gram，字符n-gram，功能字n-gram及其组合。应用各种特征加权方案后，我们在火车检测分裂上运行一系列分类器。结果显示了Adaboost分类器的大量性能增益，具有0.87 F1（假）和0.90 F1（Real）。我们提供对不同指标进行评估的结果，以便对未来研究的方便比较。数据集公开可用于研究目的。

著录项

来源
《Journal of intelligent & fuzzy systems: Applications in Engineering and Technology》 |2020年第2期|共13页
作者
Amjad Maaz; Sidorov Grigori; Zhila Alisa; Gomez-Adorno Helena; Voronkov Ilia; Gelbukh Alexander;
展开▼
作者单位

Inst Politecn Nacl Ctr Invest Comp CIC Mexico City DF Mexico;

Inst Politecn Nacl Ctr Invest Comp CIC Mexico City DF Mexico;

Inst Politecn Nacl Ctr Invest Comp CIC Mexico City DF Mexico;

Univ Nacl Autonoma Mexico Inst Invest Matemat Aplicadas &

Sistemas IIMAS Mexico City DF Mexico;

Moscow Inst Phys &

Technol Moscow Russia;

Inst Politecn Nacl Ctr Invest Comp CIC Mexico City DF Mexico;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化系统;
关键词
Fake news detection; Urdu corpus; language resources; benchmark dataset; classification; machine learning;

机译：假新闻检测;乌尔都语语料库;语言资源;基准数据集;分类;机器学习;

相似文献

外文文献
中文文献
专利

1. "Bend the truth": Benchmark dataset for fake news detection in Urdu language and its evaluation [J] . Amjad Maaz, Sidorov Grigori, Zhila Alisa, Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2020,第2Pta2期

机译：“弯曲真相”：乌尔都语语言中假新闻检测的基准数据集及其评估
2. Performing Natural Language Processing on Roman Urdu Datasets [J] . Zareen Sharf, Dr Saif Ur Rahman International journal of computer science and network security . 2018,第1期

机译：在罗马乌尔都语数据集上执行自然语言处理
3. Fake news detection in multiple platforms and languages [J] . Arruda Faustini Pedro Henrique, Covoes Thiago Ferreira Expert systems with applications . 2020,第Nova期

机译：虚假的新闻检测多个平台和语言
4. Truth or Lie: Pre-emptive Detection of Fake News in Different Languages Through Entropy-based Active Learning and Multi-model Neural Ensemble [C] . Md. Saqib Hasan, Rukshar Alam, Muhammad Abdullah Adnan IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining;International Workshop on Mining and Analyzing Social Networks for Decision Support;International Workshop on Social Network Analysis in Applications;Workshop on Social Influence;International Workshop on Social Network Analysis Surveillance Technologies;Workshop on Business Intelligence and Social Networks . 2020

机译：真相或谎言：通过基于熵的主动学习和多模型神经集合的不同语言先发制人的假新闻
5. The Strategy of Fake News: A Polemic on Lies, the Attack on the Truth and the Mainstream Media's Response [D] . Lewis, Joshua R. 2019

机译：假新闻策略：对谎言，对真理的攻击和主流媒体的回应
6. Fake news detection: a survey of evaluation datasets [O] . Arianna D’Ulizia, Maria Chiara Caschera, Fernando Ferri, 2021

机译：假新闻检测：评估数据集的调查
7. "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection [O] . Wang, William Yang 2017

机译：“Liar，Liar pants on Fire”：假新闻的新基准数据集发现

'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation

摘要

著录项

相似文献

相关主题

期刊订阅