COMPOUND SENTENCE SEGMENTATION AND SENTENCE BOUNDARY DETECTION IN URDU

机译：URDU中复合句分割和句子边界检测

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The raw Urdu corpus comprises of irregular and large sentences which need to be properly segmented in order to make them useful in Natural Language Engineering (NLE). This makes the Compound Sentences Segmentation (CSS) timely and vital research topic. The existing online text processing tools are developed mostly for computationally developed languages such as English, Japanese and Spanish etc., where sentence segmentation is mostly done on the basis of delimiters. Our proposed approach uses special characters as sentence delimiters and computationally extracted sentence-end-letters and sentence-end-words as identifiers for segmentation of large and compound sentences. The raw and un-annotated input text is passed through preprocessing and word segmentation. Urdu word segmentation itself is a complex task including knotty problems such as space insertion and space deletion etc. Main and subordinate clauses are identified and marked for subsequent processing The resultant text is further processed in order to identify, extract and then segment large as well as compound sentences into regular Urdu sentences. Urdu computational research is in its infancy. Our work is pioneering in Urdu CSS and results achieved by our proposed approach are promising. For experimentation, we used a general genre raw Urdu corpus containing 2616 sentences and 291503 words. We achieved 34% improvement in reduction of average sentence length from 111 w/s to 38 w/s (words per sentence) This increased the number of sentences by almost three times to 7536 shorter and computationally easy to manage sentences. Resultant text reliability and coherence are verified by Urdu language experts.

机译：原始Urdu语料库包括不规则和大句，需要正确分割，以使它们在自然语言工程（NLE）中有用。这使得复合句子分割（CSS）及时和重要的研究主题。现有的在线文本处理工具主要用于计算开发的语言，如英语，日语和西班牙语等，其中句子分割主要是在分隔符的基础上完成。我们所提出的方法使用特殊字符作为句子分隔符，并计算地提取的句子 - 结束字母和句子结束词作为大型和复合句子分割的标识符。通过预处理和单词分段传递原始和未注释的输入文本。 URDU字分割本身是一个复杂的任务，包括节空间问题，例如空间插入和空间删除等。识别并标记为后续处理的主要和从属子句，以便进一步处理所得到的文本，以便识别，提取和然后段复合句子转变为常规乌尔都语句子。乌尔都语计算研究处于起步性。我们的作品在乌尔都语CSS开创，并通过我们提出的方法实现的结果是有前途的。对于实验，我们使用了含有2616个句子和单词291503一般流派原始乌尔都语语料库。我们从111 w / s减少到38 w / s的平均句子长度的提高了34％（每句话字）这增加了句子数量差不多三次到7536更短，计算地易于管理句子。由乌尔都语专家验证结果文本可靠性和一致性。

著录项

来源
《Modern Artificial Intelligence and Cognitive Science Conference》|2018年|206p|共8页
会议地点
作者
ASAD IQBAL; ASAD HABIB; JAWAD ASHRAF;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词
Urdu sentence segmentation; Sentence tokenization; Word tokenization; Compound sentence segmentation; Urdu conjunction extraction; Urdu sentence delimiter identification;

机译：乌尔都语句子分割;句子牌匾;词标记;复合句分割;乌尔都语联合提取;乌尔都语句子分隔符识别;

相似文献

外文文献
中文文献
专利

1. An Artificial Neural Network Approach for Sentence Boundary Disambiguation in Urdu Language Text [J] . Raj Shazia, Rehman Zobia, Rauf Sonia, The international arab journal of information technology . 2015,第4期

机译：乌尔都语文本句子边界消歧的人工神经网络方法
2. A Hybrid Approach for Urdu Sentence Boundary Disambiguation [J] . Zobia Rehman, Waqas Anwar The international arab journal of information technology . 2012,第3期

机译：乌尔都语句子边界消歧的混合方法
3. Neural sentence embedding using only in-domain sentences for out-of-domain sentence detection in dialog systems [J] . Ryu Seonghan, Kim Seokhwan, Choi Junhwi, Pattern recognition letters . 2017,第Mara1期

机译：在对话系统中仅使用域内语句进行神经语句嵌入以进行域外语句检测
4. COMPOUND SENTENCE SEGMENTATION AND SENTENCE BOUNDARY DETECTION IN URDU [C] . ASAD IQBAL, ASAD HABIB, JAWAD ASHRAF Modern Artificial Intelligence and Cognitive Science Conference . 2018

机译：URDU中复合句分割和句子边界检测
5. Automatic Design of Prosodic Features for Sentence Segmentation [D] . Fung, James G. 2011

机译：句子分割的韵律特征的自动设计
6. A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain [O] . Denis Griffis, Chaitanya Shivade, Eric Fosler-Lussier, 2016

机译：用于临床领域的句子边界检测的定量和定性评估
7. A Classification Tree Approach to Automatic Segmentation of Japanese Compound Sentences [O] . Zhang Yujie, 尾関和彦 1999

机译：日语复合句自动切分的分类树方法
8. Comparing Evaluation Metrics for Sentence Boundary Detection [R] . Liu, Y. , Shriberg, E. 2007

机译：句子边界检测的评价指标比较

COMPOUND SENTENCE SEGMENTATION AND SENTENCE BOUNDARY DETECTION IN URDU

摘要

著录项

相似文献

相关主题

期刊订阅