首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Wasf-Vec: Topology-based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology
【24h】

Wasf-Vec: Topology-based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

机译:WASF-VEC:拓扑的Word嵌入现代标准的阿拉伯语和伊拉克方言本体

获取原文
获取原文并翻译 | 示例

摘要

Word clustering is a serious challenge in low-resource languages. Since words that share semantics are expected to be clustered together, it is common to use a feature vector representation generated from a distributional theory-based word embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low-resource Iraqi vocabulary. We began with a new Dialect Fast Stemming Algorithm (DFSA) that utilizes the MSA data. The proposed algorithm achieved 0.85 accuracy measured by the F1 score. Then, the distributional theory-based word embedding method and a new simple, yet effective, feature vector named Wasf-Vec word embedding are tested. Wasf-Vec word representation utilizes a word's topology features. The difference between Wasf-Vec and distributional theory-based word embedding is that Wasf-Vec captures relations that are not contextually based. The embedding is followed by an analysis of how the dialect words are clustered within other MSA words. The analysis is based on the word semantic relations that are well supported by solid linguistic theories to shed light on the strong and weak word relation representations identified by each embedding method. The analysis is handled by visualizing the feature vector in two-dimensional (2D) space. The feature vectors of the distributional theory-based word embedding method are plotted in 2D space using the t-sne algorithm, while the Wasf-Vec feature vectors are plotted directly in 2D space. A word's nearest neighbors and the distance-histograms of the plotted words are examined. For validation purpose of the word classification used in this article, the produced classes are employed in Class-based Language Modeling (CBLM). Wasf-Vec CBLM achieved a 7% lower perplexity (pp) than the distributional theory-based word embedding method CBLM. This result is significant when working with low-resource languages.
机译:单词聚类是低资源语言的严重挑战。由于预期共享语义的单词将聚集在一起,因此通常使用从基于分布理论的Word嵌入方法生成的特征向量表示。这项工作的目标是利用现代标准阿拉伯语(MSA)以获得低资源伊拉克词汇的更好的聚类性能。我们开始采用新的方言快速溶解算法(DFSA),它利用MSA数据。所提出的算法通过F1得分达到了0.85精度。然后,测试了分布理论的单词嵌入方法和一个名为WASF-VEC Word嵌入的新简单但有效的特征向量。 WASF-VEC Word表示使用Word的拓扑功能。 WASF-VEC和分布理论的词嵌入之间的差异是BASF-VEC捕获不是上下文基于的关系。嵌入后接下来分析方言单词如何在其他MSA单词中群集。该分析基于由每个嵌入方法识别的强大语言理论的固体语言理论得到良好的语义关系。通过可视化二维(2D)空间中的特征向量来处理分析。使用T-SNE算法在2D空间中绘制分布理论的Word嵌入方法的特征向量,而WASF-VEC特征向量直接绘制在2D空间中。检查一个单词的最近邻居和绘制单词的距离直方图。对于本文中使用的单词分类的验证目的,所产生的类是基于类的语言建模(CBLM)。 BASF-VEC CBLM比分布理论的词嵌入方法CBLM实现了7%的困惑(PP)。使用低资源语言时,此结果很重要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号