【24h】

Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences

机译:驯服非结构化:从部分标记的示意图文本序列创建结构化内容

获取原文
获取原文并翻译 | 示例

摘要

Numerous data sources such as classified ads in online newspapers, electronic product catalogs and postal addresses are rife with unstructured text content. Typically such content is characterized by attribute value sequences having a common schema. In addition each sequence is unstructured free text without any separators between the attribute values. Hidden Markov Models (HMMs) have been used for creating structured content from such text sequences by identifying and extracting attribute values occurring in them. Extant approaches to creating "structured content from text sequences" based on HMMs use either completely labeled or completely unlabeled training data. The HMMs resulting from these two dominant approaches present contrasting trade offs w.r.t. labeling effort and recall/precision of the extracted attribute values. In this paper we propose a HMM based algorithm that uses partially labeled training data for creating structured content from text sequences. By exploiting the observation that partially labeled sequences give rise to independent subsequences we compose the HMMs corresponding to these subsequences to create structured content from the complete sequence. An interesting aspect of our approach is that it gives rise to a family of HMMs spanning the trade off spectrum. We present experimental evidence of the effectiveness of our algorithm on real-life data sets and demonstrate that it is indeed possible to bootstrap structured content creation from schematic text data sources using HMMs that require limited labeling effort and do so without compromising on the recall/precision performance metrics.
机译:大量数据源,例如在线报纸上的分类广告,电子产品目录和邮政地址,都充斥着非结构化的文本内容。通常,这种内容的特征是具有共同模式的属性值序列。另外,每个序列都是非结构化的自由文本,属性值之间没有任何分隔符。隐马尔可夫模型(HMM)已用于通过识别和提取出现在其中的属性值来从此类文本序列创建结构化内容。基于HMM创建“根据文本序列的结构化内容”的现有方法使用完全标记或完全未标记的训练数据。由这两种主要方法产生的HMM表现出不同的权衡取舍。标记工作量和提取的属性值的召回率/精度。在本文中,我们提出了一种基于HMM的算法,该算法使用部分标记的训练数据从文本序列创建结构化内容。通过利用部分标记的序列产生独立子序列的观察,我们组成了与这些子序列相对应的HMM,以从完整序列中创建结构化内容。我们方法的一个有趣的方面是,它引起了一系列权衡取舍的HMM。我们提供了我们的算法在实际数据集上的有效性的实验证据,并证明了确实有可能使用需要有限标记工作的HMM从示意图文本数据源引导结构化内容的创建,并且这样做不会影响召回率/精度。性能指标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号