首页> 外文会议>International Conference on Applications of Natural Language to Informations Systems >A System for Adaptive Information Extraction from Highly Informal Text
【24h】

A System for Adaptive Information Extraction from Highly Informal Text

机译:来自高度非正式文本的自适应信息提取系统

获取原文

摘要

We present a first version of ADO, a system for Adaptive Data Organization, that is, information extraction from highly informal text: short text messages, classified ads, tweets, etc. It is built on a modular architecture that integrates in a transparent way off-the-shelf NLP tools, general procedures on strings and machine learning and processes tailored to a domain. The system is called adaptive because it implements a semi-supervised approach. Knowledge resources are initially built by hand, and they are updated automatically by feeds from the corpus. This allows ADO to adapt to the rapidly changing user-generated language. In order to estimate the impact of future developments, we have carried out an orientative evaluation of the system with a small corpus of classified advertisements of the real estate domain in Spanish. This evaluation shows that tokenization and chunking can be well resolved by simple techniques, but normalization, morphosyntactic and semantic tagging require either more complex techniques or a bigger training corpus.
机译:我们展示了ADO的第一个版本,一个自适应数据组织系统,即来自高度非正式文本的信息提取:短文本消息,分类广告,推文等。它是基于模块化架构,以透明的方式集成 - 货架NLP工具,串行和机器学习的一般程序和对域定制的流程。系统称为自适应,因为它实现了半监督方法。知识资源最初由手工制造,它们由来自语料库的源自动更新。这允许ADO适应快速改变的用户生成的语言。为了估计未来发展的影响,我们都进行了系统的orientative评价与西班牙房地产领域的分类广告的一个小语料库。该评估表明,通过简单的技术,可以很好地解决令牌化和块,但标准化,形态化和语义标记需要更复杂的技术或更大的训练语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号