首页> 外文会议>Conference of the European Chapter of the Association for Computational Linguistics >Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries
【24h】

Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

机译:来自文盲社区的建立代表Corpora:对发展中国家的挑战和缓解战略进行审查

获取原文

摘要

Most well-established data collection methods currently adopted in NLP depend on the assumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepre-sented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora: we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.
机译:NLP目前采用的最良好的数据收集方法取决于扬声器素养的假设。 因此,收集的Corpora很大程度上没有代表全球人口的息息,这往往是社会中最脆弱和最边缘化的人,并且经常生活在农村发展中地区。 因此,在制定建模和系统设计决策时不仅忽略了这种不足的群体,而且还防止受益于通过数据驱动的NLP实现的发展结果。 本文旨在解决NLP Grouora中文盲社区的陈述:我们确定可能在从低收入国家的高文盲率的农村社区收集数据时可能出现的潜在偏见和道德问题,并提出了一套实际缓解策略 帮助未来的工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号