Modeling Information in Textual Data Combining Labeled and Unlabeled Data

机译：在标记和未标记数据中的文本数据中建模信息

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The paper describes two approaches to modeling word normalization (such as replacing "wrote" or "writing" by "write") based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.

机译：本文介绍了两个对建模字标准化的方法（例如通过“写入”）基于重新发生的模式（替换“写入”或“写作”）：Word后缀和从文本获取的单词的上下文中的“写入”。为了收集模式，我们首先使用两个独立的功能集代表数据，然后找到负责特定单词映射的模式。该建模基于形式的一组手工标记的单词（单词，标准化字）和从Web获得的28个小说中的文本，并用于获取单词上下文。由于手工标签是一个苛刻的任务，我们通过逐步添加未标记的例子来调查改善我们建模的可能性。即，我们使用基于Word后缀的初始模型来预测标签。然后，我们通过预测标签扩大培训设置，其中模型最肯定。实验表明，这有助于基于上下文的方法，同时在很大程度上损害了基于后缀的方法。为了了解标记的数量而不是未标记的示例的影响，我们将与提供更多标记数据时的情况进行比较。

著录项

来源
《ESF exploratory workshop on pattern detection and discovery》|2002年||共10页
会议地点
作者
Dunja Mladenic;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TN971.5;
关键词

相似文献

外文文献
中文文献
专利

1. Combining labeled and unlabeled data with graph embedding [J] . Haitao Zhao Neurocomputing . 2006,第16a18期

机译：将标记和未标记的数据与图形嵌入结合
2. A method for training lane detection models using unlabeled data and unaligned labels [J] . Research Disclosure . 2020,第674期

机译：使用未标记数据和未对齐标签进行培训车道检测模型的方法
3. USING LABELED AND UNLABELED DATA FOR PROBABILISTIC MODELING OF FACE ORIENTATION [J] . SHUMEET BALUJA International Journal of Pattern Recognition and Artificial Intelligence . 2000,第8期

机译：使用标记和未标记的数据进行人脸定向的概率建模
4. Modeling Information in Textual Data Combining Labeled and Unlabeled Data [C] . Dunja Mladenic ESF Exploratory Workshop on Pattern Detection and Discovery, Sep 16-19, 2002, London, UK . 2002

机译：文本数据中结合标签数据和未标签数据的建模信息
5. Combining labeled and unlabeled data in statistical natural language parsing. [D] . Sarkar, Anoop. 2002

机译：在统计自然语言解析中组合标记和未标记的数据。
6. A combined approach to data mining of textual and structured data to identify cancer-related targets [O] . Pavel Pospisil, Lakshmanan K Iyer, S James Adelstein, 2006

机译：文本和结构化数据的数据挖掘组合方法以识别与癌症相关的目标
7. Combining Labeled and Unlabeled Data with Word-Class Distribution Learning [O] . Yanjun Qi, Koray Kavukcuoglu, Ronan Collobert, 2010

机译：将标记和未标记的数据与词类分布学习结合
8. Cognitive Study of Learning with Labeled and Unlabeled Data. [R] . Zhu, X., Rogers, T. T. 2012

机译：标记和未标记数据学习的认知研究。

Modeling Information in Textual Data Combining Labeled and Unlabeled Data

摘要

著录项

相似文献

相关主题

期刊订阅