首页> 外文期刊>Data >From a Smoking Gun to Spent Fuel: Principled Subsampling Methods for Building Big Language Data Corpora from Monitor Corpora
【24h】

From a Smoking Gun to Spent Fuel: Principled Subsampling Methods for Building Big Language Data Corpora from Monitor Corpora

机译:从抽烟的枪到消耗的燃料:从Monitor Corpora构建大语言数据语料库的原则性子采样方法

获取原文
       

摘要

With the influence of Big Data culture on qualitative data collection, acquisition, andprocessing, it is becoming increasingly important that social scientists understand the complexityunderlying data collection and the resulting models and analyses. Systematic approaches for creatingcomputationally tractable models need to be employed in order to create representative, specializedreference corpora subsampled from Big Language Data sources. Even more importantly, any suchmethod must be tested and vetted for its reproducibility and consistency in generating a representativemodel of a particular population in question. This article considers and tests one such method forBig Language Data downsampling of digitally accessible language data to determine both howto operationalize this form of corpus model creation, as well as testing whether the method isreproducible. Using the U.S. Nuclear Regulatory Commission’s public documentation databaseas a test source, the sampling method’s procedure was evaluated to assess variation in the rate ofwhich documents were deemed fit for inclusion or exclusion from the corpus across four iterations.After performing multiple sampling iterations, the approach pioneered by the Tobacco DocumentsCorpus creators was deemed to be reproducible and valid using a two-proportion z-test at a 99%confidence interval at each stage of the evaluation process–leading to a final mean rejection ratioof 23.5875 and variance of 0.891 for the documents sampled and evaluated for inclusion into thefinal text-based model. The findings of this study indicate that such a principled sampling method isviable, thus necessitating the need for an approach for creating language-based models that accountfor extralinguistic factors and linguistic characteristics of documents.
机译:随着大数据文化对定性数据收集,获取和处理的影响,使社会科学家了解数据收集以及由此产生的模型和分析背后的复杂性变得越来越重要。为了创建从Big Language Data来源进行二次采样的代表性专业参考语料库,需要采用创建可计算的易处理模型的系统方法。甚至更重要的是,在生成特定人群的代表性模型时,必须测试和审查任何此类方法的可重复性和一致性。本文考虑并测试了一种这样的方法,即对可数字访问的语言数据进行大语言数据下采样,以确定如何操作这种形式的语料库模型创建,以及测试该方法是否可重现。使用美国核监管委员会的公共文档数据库作为测试源,对抽样方法的程序进行了评估,以评估认为在四次迭代中哪些文档适合于从主体中包含或排除的比率的变化。烟草文档公司的创建者在评估过程的每个阶段使用置信区间为99%的二次比例z检验,认为Corpus的创建者具有可复制性和有效性,因此最终平均拒绝率为23.5875,样本抽样的方差为0.891并评估是否包含在基于最终文本的模型中。这项研究的结果表明,这种有原则的抽样方法是可行的,因此需要一种方法来创建基于语言的模型,该模型应考虑文档的语言外因素和语言特性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号