首页> 外文会议>ESF Exploratory Workshop on Pattern Detection and Discovery, Sep 16-19, 2002, London, UK >Modeling Information in Textual Data Combining Labeled and Unlabeled Data
【24h】

Modeling Information in Textual Data Combining Labeled and Unlabeled Data

机译:文本数据中结合标签数据和未标签数据的建模信息

获取原文
获取原文并翻译 | 示例

摘要

The paper describes two approaches to modeling word normalization (such as replacing "wrote" or "writing" by "write") based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.
机译:本文介绍了两种基于单词重复出现的模式来建模单词规范化的方法(例如用“写”替换“写”或“写”):单词后缀和从文本中获得的单词上下文。为了收集模式,我们首先使用两个独立的特征集表示数据,然后找到负责特定单词映射的模式。该建模基于一组形式的手工标记单词(单词,规范化单词)和从网络获得的28部小说中的文本,这些文本用于获取单词上下文。由于手工标记是一项艰巨的任务,我们将逐步添加未标记的示例来研究改进模型的可能性。即,我们使用基于单词后缀的初始模型来预测标签。然后,通过带有最确定模型的预测标签的示例来扩展训练集。实验表明,这有助于基于上下文的方法,同时大大损害了基于后缀的方法。为了了解标记数量而不是未标记示例的影响,我们与仅提供更多标记数据的情况进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号