首页> 外文期刊>Journal of Emerging Technologies in Web Intelligence >Automatic Text Summarization System for Punjabi Language
【24h】

Automatic Text Summarization System for Punjabi Language

机译:旁遮普语的自动文本摘要系统

获取原文
           

摘要

—This paper concentrates on single document multi news Punjabi extractive summarizer. Although lot of research is going on in field of multi document news summarization systems but not even a single paper was found in literature for single document multi news summarization for any language. It is first time that this system has been developed for Punjabi language and is available online at: http://pts.learnpunjabi.org/. Punjab is one of Indian states and Punjabi is its official language. Punjabi is under resourced language. Various linguistic resources for Punjabi were also developed first time as part of this project like Punjabi noun morph, Punjabi stemmer and Punjabi named entity recognition, Punjabi keywords identification, normalization of Punjabi nouns etc. A Punjabi document (like single page of Punjabi E-news paper) can have hundreds of multi news of varying length. Based on compression ratio selected by user, this system starts by extracting headlines of each news, lines just next to headlines and other important lines depending upon their importance. Selection of sentences is on the basis of statistical and linguistic features of sentences. This system comprises of two main steps: Pre Processing and Processing phase. Pre Processing phase represents the Punjabi text in structured way. In processing phase, different features deciding the importance of sentences are determined and calculated. Some of the statistical features are Punjabi keywords identification, relative sentence length feature and numbered data feature. Various linguistic features for selecting important sentences in summary are: Punjabiheadlines identification, identification of lines just next to headlines, identification of Punjabi-nouns, identification of Punjabi-proper-nouns, identification of common-English- Punjabi-nouns, identification of Punjabi-cue-phrases and identification of title-keywords in sentences. Scores of sentences are determined from sentence-feature-weight equation. Weights of features are determined using mathematical regression. Using regression, feature values of some Punjabi documents which are manually summarized are treated as independent input values and their corresponding dependent output values are provided. In the training phase, manually summaries of fifty newsdocuments are made by giving fuzzy scores to the sentences of those documents and then regression is applied for finding values of feature-weights and then average values of feature-weights are calculated. High scored sentences in proper order are selected for final summary. In final summary, sentences coherence is maintained by properly ordering the sentences in the same order as they appear in the input text at the selective compression ratios. This extractive Punjabi summarizer is available online.
机译:—本文集中于单文档多消息旁遮普语摘录摘要器。尽管在多文档新闻摘要系统领域中正在进行大量研究,但是在文献中甚至没有找到针对任何语言的单文档多新闻摘要的论文。这是首次针对旁遮普语开发该系统,并且可以从以下网址在线获得:http://pts.learnpunjabi.org/。旁遮普邦是印度的州之一,旁遮普邦是其官方语言。旁遮普语使用资源丰富的语言。作为该项目的一部分,还首次开发了用于旁遮普语的各种语言资源,例如旁遮普语名词变体,旁遮普语词干和旁遮普语命名实体识别,旁遮普语关键词识别,旁遮普语名词规范化等。纸)可以包含数百个不同长度的多则新闻。基于用户选择的压缩率,该系统首先提取每个新闻的标题,紧挨标题的行以及其他重要行(取决于其重要性)。句子的选择基于句子的统计和语言特征。该系统包括两个主要步骤:预处理和处理阶段。预处理阶段以结构化方式表示旁遮普文字。在处理阶段,确定并计算决定句子重要性的不同特征。一些统计特征是旁遮普关键词识别,相对句子长度特征和编号数据特征。在摘要中选择重要句子的各种语言功能包括:旁遮普标题识别,标题旁边的行识别,旁遮普名词的识别,旁遮普专有名词的识别,普通英语旁遮普名词的识别,旁遮普语的识别提示短语和句子中标题关键字的识别。句子分数由句子特征权重方程确定。使用数学回归确定特征权重。使用回归,将手动汇总的某些旁遮普文档的特征值视为独立的输入值,并提供其相应的从属输出值。在训练阶段,通过给那些文档的句子赋予模糊分数来手动总结五十个新闻文档,然后将回归应用于发现特征权重的值,然后计算特征权重的平均值。选择适当顺序的高分句子作为最终摘要。最后,通过以与选择文本压缩率相同的顺序正确排列句子的顺序来维持句子的连贯性。该提取的旁遮普文摘已在线提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号