首页> 外文会议>International conference on language resources and evaluation >The WeSearch Corpus, Treebank, and Treecache A Comprehensive Sample of User-Generated Content
【24h】

The WeSearch Corpus, Treebank, and Treecache A Comprehensive Sample of User-Generated Content

机译:WeSearch语料库,树库和树缓存用户生成内容的综合样本

获取原文

摘要

We present the WeSearch Data Collection (WDC)-a freely redistributable, partly annotated, comprehensive sample of User-Generated Content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs and Wikipedia) and covers two different domains (NLP and Linux). In this article, we describe the data selection and extraction process, with a focus on the extraction of linguistic content from different sources. We present the format of syntacto-semantic annotations found in this resource and present initial parsing results for these data, as well as some reflections following a first round of treebanking.
机译:我们介绍了WeSearch数据收集(WDC),这是一个可免费重新分发的,部分注释的,用户生成内容的全面示例。 WDC包含从各种形式不同的流派(用户论坛,产品评论站点,博客和Wikipedia)中提取的数据,并涵盖两个不同的域(NLP和Linux)。在本文中,我们描述了数据选择和提取过程,重点是从不同来源提取语言内容。我们介绍了在此资源中找到的语法语义注释的格式,并提供了这些数据的初始解析结果,以及在第一轮树状存储之后的一些思考。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号