Turkish lexicon expansion by using finite state automata

MUSTAFA BURAK ?ZTüRK; BURCU CAN BU?LALILAR

首页> 外文期刊>Turkish Journal of Electrical Engineering and Computer Sciences >Turkish lexicon expansion by using finite state automata

【24h】

Turkish lexicon expansion by using finite state automata

机译：通过使用有限状态自动机进行土耳其语词典扩展

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36 %, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.

机译：土耳其语是一种具有多种形态的凝集性语言。土耳其语动词可以有成千上万种不同的单词形式。因此，稀疏性在许多土耳其自然语言处理（NLP）应用程序中成为一个问题。本文介绍了土耳其词典扩展的模型。我们旨在通过使用形态学分割系统来扩展词典，方法是将分割任务转换为生成任务。我们的模型使用有限状态自动机（FSA）来合并正交特征和词法规则。我们通过捕获每当添加后缀时应用于单词的语音操作来提取正交特征。每个FSA状态对应于词干或后缀类别。词干根据其词性（即名词，动词或形容词）进行聚类，后缀根据其同构特征进行聚类。我们仅使用几千个土耳其语词干就生成了大约一百万个字形，准确度为82.36％，这将有助于减少其他NLP应用程序的词汇量。尽管我们的实验是使用土耳其语进行的，但相同的模型也适用于其他凝集性语言，例如匈牙利语和芬兰语。

著录项

来源
《Turkish Journal of Electrical Engineering and Computer Sciences》 |2019年第2期|共16页
作者
MUSTAFA BURAK ?ZTüRK; BURCU CAN BU?LALILAR;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类工业经济;
关键词
Morphologylexicon expansionmorphological generationfinite-state automata;

机译：形态学词典展开形态学生成有限状态自动机。;

相似文献

外文文献
中文文献
专利

1. The architecture and the implementation of a finite state pronunciation lexicon for Turkish [J] . Kemal Oflazer, Sharon Inkelas Computer speech and language . 2006,第1期

机译：土耳其语有限状态发音词典的体系结构和实现
2. Multidisciplinary insight into clonal expansion of HTLV-1–infected cells in adult T-cell leukemia via modeling by deterministic finite automata coupled with high-throughput sequencing [J] . Amir Farmanbar, Sanaz Firouzi, Sung-Joon Park, BMC Medical Genomics . 2017,第1期

机译：通过确定性有限自动机结合高通量测序建模，对成年T细胞白血病中HTLV-1感染细胞的克隆扩增进行多学科研究
3. On an expansion of nondeterministic finite automata [J] . Boris Melnikov Journal of Applied Mathematics and Computing . 2007,第1a2期

机译：关于不确定的有限自动机的展开
4. pin_cod_ at SemEval-2020 Task 12: Injecting Lexicons into Bidirectional Long Short-Term Memory Networks to Detect Turkish Offensive Tweets [C] . Pinar Arslan International Workshop on Semantic Evaluation . 2020

机译：PIN_COD_在SEMEVAL-2020任务12：将词汇注入双向短期内存网络以检测土耳其攻击性推文
5. Sequences Modulo Primes and Finite State Automata [D] . Henningsen, Joel A. 2019

机译：序列模胶和有限状态自动机
6. Multidisciplinary insight into clonal expansion of HTLV-1–infected cells in adult T-cell leukemia via modeling by deterministic finite automata coupled with high-throughput sequencing [O] . Amir Farmanbar, Sanaz Firouzi, Sung-Joon Park, 2017

机译：通过确定性有限自动机结合高通量测序建模对成年T细胞白血病中HTLV-1感染细胞的克隆扩增进行多学科研究
7. Turkish lexicon expansion by using finite state automata [O] . MUSTAFA BURAK ÖZTÜRK, BURCU CAN BUĞLALILAR 2019

机译：土耳其词典扩展通过使用有限状态自动机
8. Finite Tree Automata and omega-Automata [R] . Hossley, R. F. 1972

机译：有限树自动机和欧米茄自动机

Turkish lexicon expansion by using finite state automata

摘要

著录项

相似文献

相关主题

期刊订阅