首页> 外文会议>International conference on computational linguistics >Automatic Prediction of Text Aesthetics and Interestingness
【24h】

Automatic Prediction of Text Aesthetics and Interestingness

机译:文本美学和趣味性的自动预测

获取原文

摘要

This paper investigates the problem of automated text aesthetics prediction. The availability of user generated content and ratings, e.g. Flickr, has induced research in aesthetics prediction for non-text domains, particularly for photographic images. This problem, however, has yet not been explored for the text domain. Due to the very subjective nature of text aesthetics, it is difficult to compile human annotated data by methods such as crowd sourcing with a fair degree of inter-annotator agreement. The availability of the Kindle "popular highlights" data has motivated us to compile a dataset comprised of human annotated aesthetically pleasing and interesting text passages. We then undertake a supervised classification approach to predict text aesthetics by constructing real-valued feature vectors from each text passage. In particular, the features that we use for this classification task are word length, repetitions, polarity, part-of-speech, semantic distances; and topic generality and diversity. A traditional binary classification approach is not effective in this case because non-highlighted passages surrounding the highlighted ones do not necessarily represent the other extreme of unpleasant quality text. Due to the absence of real negative class samples, we employ the MC algorithm, in which training can be initiated with instances only from the positive class. On each successive iteration the algorithm selects new strong negative samples from the unlabeled class and retrains itself. The results show that the mapping convergence (MC) algorithm with a Gaussian and a linear kernel used for the mapping and convergence phases, respectively, yields the best results, achieving satisfactory accuracy, precision and recall values of about 74%, 42% and 54% respectively.
机译:本文研究了自动文本美学预测的问题。用户生成的内容和评分的可用性,例如Flickr引发了针对非文本领域(尤其是摄影图像)的美学预测的研究。但是,尚未针对文本域探索此问题。由于文本美学的非常主观的性质,很难通过诸如众包之间具有相当程度的批注者协议的方式来编译人类批注数据。 Kindle“热门集锦”数据的可用性促使我们编制了一个数据集,该数据集由带有人类注释的美学上令人愉悦和有趣的文字段落组成。然后,我们采用监督分类方法,通过从每个文本段落中构造实值特征向量来预测文本的美观程度。特别地,我们用于此分类任务的特征是单词长度,重复次数,极性,词性,语义距离;以及主题的普遍性和多样性。在这种情况下,传统的二进制分类方法无效,因为围绕突出显示的文本的非突出显示的段落不一定代表不愉快的质量文本的另一个极端。由于没有真正的负面类样本,我们采用了MC算法,在该算法中,只能使用正面类的实例来开始训练。在每次连续迭代中,算法都会从未标记的类别中选择新的强负样本并对其进行重新训练。结果表明,将高斯和线性核分别用于映射和收敛阶段的映射收敛(MC)算法产生了最佳结果,获得了令人满意的准确性,准确性和查全率,分别为74%,42%和54 % 分别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号