首页> 外文会议>ESF exploratory workshop on pattern detection and discovery >Modeling Information in Textual Data Combining Labeled and Unlabeled Data
【24h】

Modeling Information in Textual Data Combining Labeled and Unlabeled Data

机译:在标记和未标记数据中的文本数据中建模信息

获取原文

摘要

The paper describes two approaches to modeling word normalization (such as replacing "wrote" or "writing" by "write") based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.
机译:本文介绍了两个对建模字标准化的方法(例如通过“写入”)基于重新发生的模式(替换“写入”或“写作”):Word后缀和从文本获取的单词的上下文中的“写入”。为了收集模式,我们首先使用两个独立的功能集代表数据,然后找到负责特定单词映射的模式。该建模基于形式的一组手工标记的单词(单词,标准化字)和从Web获得的28个小说中的文本,并用于获取单词上下文。由于手工标签是一个苛刻的任务,我们通过逐步添加未标记的例子来调查改善我们建模的可能性。即,我们使用基于Word后缀的初始模型来预测标签。然后,我们通过预测标签扩大培训设置,其中模型最肯定。实验表明,这有助于基于上下文的方法,同时在很大程度上损害了基于后缀的方法。为了了解标记的数量而不是未标记的示例的影响,我们将与提供更多标记数据时的情况进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号