【24h】

Characterizing Text Difficulty with Word Frequencies

机译:用词频度表征文本难度

获取原文

摘要

Natural language processing (NLP) methodologies have been widely adopted for readability assessment and greatly enhanced predictive accuracy. In the present study, we study a well-established feature, the frequency of a word in common language use, and systematically explore how such a word-level feature is best used to characterize the reading levels of texts, a text-level classification problem. While traditionally such word-level features are simply averaged for all words of given text, we show that a richer representation leads to significantly better predictive models. A basic approach adding a feature for the standard deviation already shows clear gains, and two more complex options systematically integrating more frequency information are explored: (ⅰ) encoding separate means for the words of a text according to which frequency band of the language they occur in, and (ⅱ) encoding the mean of each cluster of words obtained by agglomerative hierarchical clustering of the words in the text based on their frequency. The former organizes frequency around general language characteristics, whereas the latter aims to lose as little information as possible about the distribution of word frequencies in a given text. To investigate the generalizability of the results, we compare cross-validation experiments within a corpus with cross-corpus experiments testing on the Common Core State Standards reference texts. We also contrast two different frequency norms and compare frequency with a measure of contextual diversity.
机译:自然语言处理(NLP)方法已被广泛用于可读性评估并大大提高了预测准确性。在本研究中,我们研究了一个完善的功能,即在公共语言中使用单词的频率,并系统地探索了如何最好地利用这种单词级别的功能来表征文本的阅读级别,这是一个文本级别的分类问题。传统上,仅对给定文本的所有单词平均这种单词级别的功能,但我们表明,更丰富的表示形式会导致更好的预测模型。为标准偏差添加功能的基本方法已经显示出明显的增益,并且探索了系统集成了更多频率信息的两个更复杂的选项:(ⅰ)根据出现的语言频带对文本的单词分别编码;和(ⅱ)编码每个单词簇的平均值,这些单词簇是根据单词的出现频率通过对文本中单词的聚集性层次聚类获得的。前者围绕通用语言特征来组织频率,而后者的目的是使有关给定文本中单词频率分布的信息尽可能少。为了调查结果的可推广性,我们将语料库中的交叉验证实验与对通用核心州标准参考文本进行的跨语料实验进行比较。我们还对比了两个不同的频率范数,并将频率与上下文多样性进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号