首页> 外文期刊>ACM transactions on Asian language information processing >An Information-Extraction System for Urdu-A Resource-Poor Language
【24h】

An Information-Extraction System for Urdu-A Resource-Poor Language

机译:乌尔都语资源贫乏语言信息提取系统

获取原文
获取原文并翻译 | 示例
           

摘要

There has been an increase in the amount of multilingual text on the Internet due to the proliferation of news sources and blogs. The Urdu language, in particular, has experienced explosive growth on the Web. Text mining for information discovery, which includes tasks such as identifying topics, relationships and events, and sentiment analysis, requires sophisticated natural language processing (NLP). NLP systems begin with modules such as word segmentation, part-of-speech tagging, and morphological analysis and progress to modules such as shallow parsing and named entity tagging. While there have been considerable advances in developing such comprehensive NLP systems for English, the work for Urdu is still in its infancy. The tasks of interest in Urdu NLP includes analyzing data sources such as blogs and comments to news articles to provide insight into social and human behavior. All of this requires a robust NLP system. The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. The annotated data required to train the learning models used here is acquired by standardizing the currently limited resources available for Urdu. Techniques such as bootstrap learning and resource sharing from a syntactically similar language, Hindi, are explored to augment the available annotated Urdu data. Each of the new Urdu text processing modules has been integrated into a general text-mining platform. The evaluations performed demonstrate that the accuracies have either met or exceeded the state of the art.
机译:由于新闻来源和博客的激增,互联网上的多语言文本数量有所增加。特别是乌尔都语在网络上经历了爆炸性的增长。用于信息发现的文本挖掘(包括识别主题,关系和事件以及情感分析等任务)需要复杂的自然语言处理(NLP)。 NLP系统从诸如单词分段,词性标记和形态分析等模块开始,然后发展至诸如浅层解析和命名实体标记之类的模块。尽管为英语开发这种全面的NLP系统已取得了长足的进步,但乌尔都语的工作仍处于起步阶段。 Urdu NLP感兴趣的任务包括分析数据源,例如博客和新闻评论,以提供对社会和人类行为的洞察力。所有这些都需要强大的NLP系统。这项工作的目的是为乌尔都语开发一个NLP基础结构,该基础结构是可自定义的,并且能够提供基础分析,可以在此基础上构建更高级的信息提取工具。该系统整合了来自各种在线资源的资源,以促进改进的命名实体标记和乌尔都语至英语的音译。通过标准化当前可用于Urdu的有限资源来获取训练此处使用的学习模型所需的带注释数据。为了探索可用的带注释的乌尔都语数据,人们探索了诸如引导学习和来自语法相似的语言Hindi的资源共享之类的技术。每个新的Urdu文本处理模块均已集成到通用文本挖掘平台中。进行的评估表明,精度已达到或超过了现有技术水平。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号