首页> 外文期刊>Journal of intelligent & fuzzy systems: Applications in Engineering and Technology >'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation
【24h】

'Bend the truth': Benchmark dataset for fake news detection in Urdu language and its evaluation

机译:“弯曲真相”:乌尔都语语言中假新闻检测的基准数据集及其评估

获取原文
获取原文并翻译 | 示例
       

摘要

The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1(Fake) and 0.90 F1(Real). We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.
机译:本文介绍了乌尔都语语言中的假新闻检测的新语料库,以及基线分类及其评价。随着世界范围内的升级和通过削弱信息的可用性产生的影响力,在各种语言中快速识别数字媒体中的假新闻的挑战变得更加尖锐。我们提供一个手动组装和验证的数据集,其中包含了900个新闻文章,500个作为Real和400作为假,允许调查Urdu的自动假新闻检测方法。真实子集中的新闻文章来自合法的新闻来源,并手动验证了他们的有效性。在假子集中,通过招聘乌尔都语的雇员名人来解决了寻找假新闻的已知难度,被指示故意写出欺骗性新闻文章。 DataSet包含5个不同的主题:(i)业务,(ii)健康,(iii)showbiz,(iv)运动和(v)技术。要将我们的URDU数据集作为基准,我们执行了基准分类。我们制作了各种文本表示功能集,包括单词n-gram,字符n-gram,功能字n-gram及其组合。应用各种特征加权方案后,我们在火车检测分裂上运行一系列分类器。结果显示了Adaboost分类器的大量性能增益,具有0.87 F1(假)和0.90 F1(Real)。我们提供对不同指标进行评估的结果,以便对未来研究的方便比较。数据集公开可用于研究目的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号