Modeling Information in Textual Data Combining Labeled and Unlabeled Data

机译：文本数据中结合标签数据和未标签数据的建模信息

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The paper describes two approaches to modeling word normalization (such as replacing "wrote" or "writing" by "write") based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.

机译：本文介绍了两种基于单词重复出现的模式来建模单词规范化的方法（例如用“写”替换“写”或“写”）：单词后缀和从文本中获得的单词上下文。为了收集模式，我们首先使用两个独立的特征集表示数据，然后找到负责特定单词映射的模式。该建模基于一组形式的手工标记单词（单词，规范化单词）和从网络获得的28部小说中的文本，这些文本用于获取单词上下文。由于手工标记是一项艰巨的任务，我们将逐步添加未标记的示例来研究改进模型的可能性。即，我们使用基于单词后缀的初始模型来预测标签。然后，通过带有最确定模型的预测标签的示例来扩展训练集。实验表明，这有助于基于上下文的方法，同时大大损害了基于后缀的方法。为了了解标记数量而不是未标记示例的影响，我们与仅提供更多标记数据的情况进行了比较。

著录项

来源
《ESF Exploratory Workshop on Pattern Detection and Discovery, Sep 16-19, 2002, London, UK》|2002年|p.170-179|共10页
会议地点 London(GB)
作者
Dunja Mladenic;
展开▼
作者单位

J.Stefan Institute, Ljubljana, Slovenia and Carnegie Mellon University, Pittsburgh, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类无线电电子学、电信技术;
关键词

相似文献

外文文献
中文文献
专利

1. Combining labeled and unlabeled data with graph embedding [J] . Haitao Zhao Neurocomputing . 2006,第16a18期

机译：将标记和未标记的数据与图形嵌入结合
2. A method for training lane detection models using unlabeled data and unaligned labels [J] . Research Disclosure . 2020,第674期

机译：使用未标记数据和未对齐标签进行培训车道检测模型的方法
3. USING LABELED AND UNLABELED DATA FOR PROBABILISTIC MODELING OF FACE ORIENTATION [J] . SHUMEET BALUJA International Journal of Pattern Recognition and Artificial Intelligence . 2000,第8期

机译：使用标记和未标记的数据进行人脸定向的概率建模
4. Modeling Information in Textual Data Combining Labeled and Unlabeled Data [C] . Dunja Mladenic ESF exploratory workshop on pattern detection and discovery . 2002

机译：在标记和未标记数据中的文本数据中建模信息
5. Combining labeled and unlabeled data in statistical natural language parsing. [D] . Sarkar, Anoop. 2002

机译：在统计自然语言解析中组合标记和未标记的数据。
6. A combined approach to data mining of textual and structured data to identify cancer-related targets [O] . Pavel Pospisil, Lakshmanan K Iyer, S James Adelstein, 2006

机译：文本和结构化数据的数据挖掘组合方法以识别与癌症相关的目标
7. Combining Labeled and Unlabeled Data with Word-Class Distribution Learning [O] . Yanjun Qi, Koray Kavukcuoglu, Ronan Collobert, 2010

机译：将标记和未标记的数据与词类分布学习结合
8. Cognitive Study of Learning with Labeled and Unlabeled Data. [R] . Zhu, X., Rogers, T. T. 2012

机译：标记和未标记数据学习的认知研究。

Modeling Information in Textual Data Combining Labeled and Unlabeled Data

摘要

著录项

相似文献

相关主题

期刊订阅