首页> 外文会议>Conference on Intelligent Text Processing and Computational Linguistics;CICLing 2014 >Simple TF·IDF Is Not the Best You Can Get for Regionalism Classification*
【24h】

Simple TF·IDF Is Not the Best You Can Get for Regionalism Classification*

机译:简单的TF·IDF不是您可以获得区域主义分类的最佳选择*

获取原文

摘要

In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problem where a term’s frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.
机译:以英语或西班牙语等广泛的语言,有类似地区的单词。例如,在英国,诸如炊具的英国通常使用的单词,而炉子在美国的那种概念是优选的。识别区域培养的特定单词涉及将它们与所有区域的常识组合鉴别。这产生了术语频率应该是足够突出的问题,同时是一个共同的术语典雅。这是术语频率与逆文档频率的已知问题;尽管如此,典型的TF·IDF应用程序不包括加权因素。在这项工作中,我们提出了多种替代公式,然后我们得出结论,我们需要在更广泛的搜索空间中挖掘;由此,我们建议使用遗传编程来找到由TF和IDF术语组成的合适表达,其可以在给定每个区域标记的示例的简化引导示例集中最大化这些术语的判断。我们为美洲和西班牙的西班牙语变化提供了绩效示例。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号