首页> 外文会议>9th International conference on language resources and evaluation >Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing
【24h】

Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing

机译:通过众包设计和评估可靠的网络体裁语料库

获取原文

摘要

Research in Natural Language Processing often relies on a large collection of manually annotated documents. However, currently there is no reliable genre-annotated corpus of web pages to be employed in Automatic Genre Identification (AGI). In AGI, documents are classified based on their genres rather than their topics or subjects. The major shortcoming of available web genre collections is their relatively low inter-coder agreement. Reliability of annotated data is an essential factor for reliability of the research result. In this paper, we present the first web genre corpus which is reliably annotated. We developed precise and consistent annotation guidelines which consist of well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing which is a novel approach in genre annotation. We computed the overall as well as the individual categories' chance-corrected inter-annotator agreement. The results show that the corpus has been annotated reliably.
机译:对自然语言处理的研究通常依赖于大量的手动注释文档。但是,当前没有可用于自动类型识别(AGI)的可靠的带有类型注释的网页语料库。在AGI中,文档是根据其类型而不是主题或主题来分类的。可用的网络体裁集合的主要缺点是其相对较低的编码者间协议。批注数据的可靠性是研究结果可靠性的重要因素。在本文中,我们提出了可靠注释的第一个网络体裁语料库。我们开发了精确且一致的注释准则,其中包括定义明确和公认的类别。为了标注语料,我们使用了众包,这是一种新的体裁注解方法。我们计算了整体以及各个类别的机会校正的注释者之间的协议。结果表明,语料库得到了可靠的注释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号