首页> 外文期刊>Social science computer review >Automatic Coding of Text Answers to Open-Ended Questions: Should You Double Code the Training Data?
【24h】

Automatic Coding of Text Answers to Open-Ended Questions: Should You Double Code the Training Data?

机译:自动编码文本答案到开放式问题:如果您将培训数据重新编码培训数据?

获取原文
获取原文并翻译 | 示例
       

摘要

Open-ended questions in surveys are often manually coded into one of several classes (or categories). When the data are too large to manually code all texts, a statistical (or machine) learning model must be trained on a manually coded subset of texts. Uncoded texts are then coded automatically using the trained model. The quality of automatic coding depends on the trained statistical model, and the model relies on manually coded data on which it is trained. While survey scientists are acutely aware that the manual coding is not always accurate, it is not clear how double coding affects the classification errors of the statistical learning model. We investigate several budget allocation strategies when there is a limited budget for manual classification: single coding versus various options for double coding where the number of training texts is reduced to maintain the fixed budget. Under fixed budget, double coding improved prediction of the learning algorithm when the coding error is greater than about 20-35%, depending on the data. Among double-coding strategies, paying for an expert to resolve differences performed best. When no expert is available, removing differences from the training data outperformed other double-coding strategies. When there is no budget constraint and the texts have already been double coded, all double-coding strategies generally outperformed single coding. As under fixed budget, having an expert to solve disagreement in training texts improves accuracy most, followed by removing differences.
机译:调查中的开放式问题通常是手动编码为几个类(或类别)之一。当数据太大而无法手动代码所有文本时,必须在手动编码的文本子集上培训统计(或机器)学习模型。然后使用培训的模型自动编码未编码的文本。自动编码的质量取决于训练有素的统计模型,并且该模型依赖于培训的手动编码数据。虽然调查科学家敏锐意识到手动编码并不总是准确,但目前尚不清楚双重编码如何影响统计学习模型的分类错误。当手动分类预算有限时,我们调查几项预算分配策略:单一编码与双重编码的各种选项,其中培训文本的数量减少以维持固定预算。根据固定预算,根据数据,当编码误差大于约20-35%时,双编码提高了学习算法的预测。在双重编码策略中,支付专家以解决最佳差异。当没有专家时,从培训数据中删除差异优于其他双重编码策略。当没有预算约束并且文本已经编码了双重编码时,所有双重编码策略通常都比单一编码总是表现优势。根据固定预算,拥有专家解决培训文本中的分歧,提高了最精确的,然后消除了差异。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号