首页> 外文会议>Modern Artificial Intelligence and Cognitive Science Conference >COMPOUND SENTENCE SEGMENTATION AND SENTENCE BOUNDARY DETECTION IN URDU
【24h】

COMPOUND SENTENCE SEGMENTATION AND SENTENCE BOUNDARY DETECTION IN URDU

机译:URDU中复合句分割和句子边界检测

获取原文

摘要

The raw Urdu corpus comprises of irregular and large sentences which need to be properly segmented in order to make them useful in Natural Language Engineering (NLE). This makes the Compound Sentences Segmentation (CSS) timely and vital research topic. The existing online text processing tools are developed mostly for computationally developed languages such as English, Japanese and Spanish etc., where sentence segmentation is mostly done on the basis of delimiters. Our proposed approach uses special characters as sentence delimiters and computationally extracted sentence-end-letters and sentence-end-words as identifiers for segmentation of large and compound sentences. The raw and un-annotated input text is passed through preprocessing and word segmentation. Urdu word segmentation itself is a complex task including knotty problems such as space insertion and space deletion etc. Main and subordinate clauses are identified and marked for subsequent processing The resultant text is further processed in order to identify, extract and then segment large as well as compound sentences into regular Urdu sentences. Urdu computational research is in its infancy. Our work is pioneering in Urdu CSS and results achieved by our proposed approach are promising. For experimentation, we used a general genre raw Urdu corpus containing 2616 sentences and 291503 words. We achieved 34% improvement in reduction of average sentence length from 111 w/s to 38 w/s (words per sentence) This increased the number of sentences by almost three times to 7536 shorter and computationally easy to manage sentences. Resultant text reliability and coherence are verified by Urdu language experts.
机译:原始Urdu语料库包括不规则和大句,需要正确分割,以使它们在自然语言工程(NLE)中有用。这使得复合句子分割(CSS)及时和重要的研究主题。现有的在线文本处理工具主要用于计算开发的语言,如英语,日语和西班牙语等,其中句子分割主要是在分隔符的基础上完成。我们所提出的方法使用特殊字符作为句子分隔符,并计算地提取的句子 - 结束字母和句子结束词作为大型和复合句子分割的标识符。通过预处理和单词分段传递原始和未注释的输入文本。 URDU字分割本身是一个复杂的任务,包括节空间问题,例如空间插入和空间删除等。识别并标记为后续处理的主要和从属子句,以便进一步处理所得到的文本,以便识别,提取和然后段复合句子转变为常规乌尔都语句子。乌尔都语计算研究处于起步性。我们的作品在乌尔都语CSS开创,并通过我们提出的方法实现的结果是有前途的。对于实验,我们使用了含有2616个句子和单词291503一般流派原始乌尔都语语料库。我们从111 w / s减少到38 w / s的平均句子长度的提高了34%(每句话字)这增加了句子数量差不多三次到7536更短,计算地易于管理句子。由乌尔都语专家验证结果文本可靠性和一致性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号