An Information-Extraction System for Urdu-A Resource-Poor Language

SMRUTHI MUKUND; ROHINI SRIHARI; ERIK PETERSON

首页> 外文期刊>ACM transactions on Asian language information processing >An Information-Extraction System for Urdu-A Resource-Poor Language

【24h】

An Information-Extraction System for Urdu-A Resource-Poor Language

机译：乌尔都语资源贫乏语言信息提取系统

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

There has been an increase in the amount of multilingual text on the Internet due to the proliferation of news sources and blogs. The Urdu language, in particular, has experienced explosive growth on the Web. Text mining for information discovery, which includes tasks such as identifying topics, relationships and events, and sentiment analysis, requires sophisticated natural language processing (NLP). NLP systems begin with modules such as word segmentation, part-of-speech tagging, and morphological analysis and progress to modules such as shallow parsing and named entity tagging. While there have been considerable advances in developing such comprehensive NLP systems for English, the work for Urdu is still in its infancy. The tasks of interest in Urdu NLP includes analyzing data sources such as blogs and comments to news articles to provide insight into social and human behavior. All of this requires a robust NLP system. The objective of this work is to develop an NLP infrastructure for Urdu that is customizable and capable of providing basic analysis on which more advanced information extraction tools can be built. This system assimilates resources from various online sources to facilitate improved named entity tagging and Urdu-to-English transliteration. The annotated data required to train the learning models used here is acquired by standardizing the currently limited resources available for Urdu. Techniques such as bootstrap learning and resource sharing from a syntactically similar language, Hindi, are explored to augment the available annotated Urdu data. Each of the new Urdu text processing modules has been integrated into a general text-mining platform. The evaluations performed demonstrate that the accuracies have either met or exceeded the state of the art.

机译：由于新闻来源和博客的激增，互联网上的多语言文本数量有所增加。特别是乌尔都语在网络上经历了爆炸性的增长。用于信息发现的文本挖掘（包括识别主题，关系和事件以及情感分析等任务）需要复杂的自然语言处理（NLP）。 NLP系统从诸如单词分段，词性标记和形态分析等模块开始，然后发展至诸如浅层解析和命名实体标记之类的模块。尽管为英语开发这种全面的NLP系统已取得了长足的进步，但乌尔都语的工作仍处于起步阶段。 Urdu NLP感兴趣的任务包括分析数据源，例如博客和新闻评论，以提供对社会和人类行为的洞察力。所有这些都需要强大的NLP系统。这项工作的目的是为乌尔都语开发一个NLP基础结构，该基础结构是可自定义的，并且能够提供基础分析，可以在此基础上构建更高级的信息提取工具。该系统整合了来自各种在线资源的资源，以促进改进的命名实体标记和乌尔都语至英语的音译。通过标准化当前可用于Urdu的有限资源来获取训练此处使用的学习模型所需的带注释数据。为了探索可用的带注释的乌尔都语数据，人们探索了诸如引导学习和来自语法相似的语言Hindi的资源共享之类的技术。每个新的Urdu文本处理模块均已集成到通用文本挖掘平台中。进行的评估表明，精度已达到或超过了现有技术水平。

著录项

来源
《ACM transactions on Asian language information processing》 |2010年第4期|p.61-103|共43页
作者
SMRUTHI MUKUND; ROHINI SRIHARI; ERIK PETERSON;
展开▼
作者单位

State University of New York at Buffalo and ERIK PETERSON Janya, Inc;

rnState University of New York at Buffalo;

Janya, Inc.;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
urdu natural language processing; named entity tagging; part of speech tagging; shallow parsing; transliterations; bootstrap learning; text mining;

机译：乌尔都语自然语言处理;命名实体标签;语音标记的一部分;浅层解析音译;引导学习;文字挖掘;

相似文献

外文文献
中文文献
专利

1. Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages [J] . Nakov P., Ng H. T. The Journal of Artificial Intelligence Research . 2012,第4期

机译：使用相关的资源丰富的语言改善资源贫乏的语言的统计机器翻译
2. Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages [J] . Preslav Nakov, Hwee Tou Ng The Journal of Artificial Intelligence Research . 2012,第Null期

机译：使用相关的资源丰富的语言改善资源贫乏的语言的统计机器翻译
3. Stemming Resource-Poor Indian Languages [J] . NAVANATH SAHARIA, UTPAL SHARMA, JUGAL KALITA ACM transactions on Asian language information processing . 2014,第3期

机译：阻止资源贫乏的印度语言
4. Crowdsourcing Speech and Language Data for Resource-Poor Languages [C] . Hamdy Mubarak International Conference on Advanced Intelligent Systems and Informatics . 2017

机译：资源差别语言的众包语音和语言数据
5. Morphological Inference from Bitext for Resource-Poor Languages [D] . Szymanski, Terrence D. 2012

机译：来自资源匮乏语言的双文本的形态学推断
6. Controlled Vocabularies Indexing and Medical Language Processing. Medical Language Processing: Database Capture of Natural Language Echocardiographic Reports: A Unified Medical Language System Approach [O] . K. Canfield, B. Bray, S. Huff, 1989

机译：受控词汇表索引编制和医学语言处理。医学语言处理：自然语言超声心动图报告的数据库捕获：统一医学语言系统方法
7. An Information-extraction system for Urdu—a resource-poor language [O] . Smruthi Mukund, Rohini Srihari, Erik Peterson 2010

机译：乌尔都语的信息提取系统 - 资源匮乏的语言

An Information-Extraction System for Urdu-A Resource-Poor Language

摘要

著录项

相似文献

相关主题

期刊订阅