Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences

机译：驯服非结构化：从部分标记的示意图文本序列创建结构化内容

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Numerous data sources such as classified ads in online newspapers, electronic product catalogs and postal addresses are rife with unstructured text content. Typically such content is characterized by attribute value sequences having a common schema. In addition each sequence is unstructured free text without any separators between the attribute values. Hidden Markov Models (HMMs) have been used for creating structured content from such text sequences by identifying and extracting attribute values occurring in them. Extant approaches to creating "structured content from text sequences" based on HMMs use either completely labeled or completely unlabeled training data. The HMMs resulting from these two dominant approaches present contrasting trade offs w.r.t. labeling effort and recall/precision of the extracted attribute values. In this paper we propose a HMM based algorithm that uses partially labeled training data for creating structured content from text sequences. By exploiting the observation that partially labeled sequences give rise to independent subsequences we compose the HMMs corresponding to these subsequences to create structured content from the complete sequence. An interesting aspect of our approach is that it gives rise to a family of HMMs spanning the trade off spectrum. We present experimental evidence of the effectiveness of our algorithm on real-life data sets and demonstrate that it is indeed possible to bootstrap structured content creation from schematic text data sources using HMMs that require limited labeling effort and do so without compromising on the recall/precision performance metrics.

机译：大量数据源，例如在线报纸上的分类广告，电子产品目录和邮政地址，都充斥着非结构化的文本内容。通常，这种内容的特征是具有共同模式的属性值序列。另外，每个序列都是非结构化的自由文本，属性值之间没有任何分隔符。隐马尔可夫模型（HMM）已用于通过识别和提取出现在其中的属性值来从此类文本序列创建结构化内容。基于HMM创建“根据文本序列的结构化内容”的现有方法使用完全标记或完全未标记的训练数据。由这两种主要方法产生的HMM表现出不同的权衡取舍。标记工作量和提取的属性值的召回率/精度。在本文中，我们提出了一种基于HMM的算法，该算法使用部分标记的训练数据从文本序列创建结构化内容。通过利用部分标记的序列产生独立子序列的观察，我们组成了与这些子序列相对应的HMM，以从完整序列中创建结构化内容。我们方法的一个有趣的方面是，它引起了一系列权衡取舍的HMM。我们提供了我们的算法在实际数据集上的有效性的实验证据，并证明了确实有可能使用需要有限标记工作的HMM从示意图文本数据源引导结构化内容的创建，并且这样做不会影响召回率/精度。性能指标。

著录项

来源
《OTM(On the Move) Confederated International Conference: CooplS(Cooperative Information Systems), DOA(Distributed Objects and Applications), and ODBASE(Ontologies, DataBases and Applications of SEmantics) 2004 pt.2; 20041025-29; Agia Napa(CY)》|2004年|P.909-926|共18页
会议地点 Agia Napa(CY)
作者
Saikat Mukherjee; I.V. Ramakrishnan;
展开▼
作者单位

Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, U.S.A.;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词

相似文献

外文文献
中文文献
专利

1. Proposed Architecture for Automatic Conversion of Unstructured Text Data into Structured Text Data on the Web [J] . CH.Madhusudhan, K.Mrithyunjaya Rao International journal of computer science and network security . 2013,第12期

机译：在网络上将非结构化文本数据自动转换为结构化文本数据的建议体系结构
2. AN ONTOLOGY TEXT MINING TO CONVERSION OF UNSTRUCTURED TO STRUCTURE TEXT IN D-MATRIX [J] . RADHIKAY.DEORE Indian Journal of Scientific Research . 2015,第1期

机译：D-矩阵中将非结构化文本转换为结构文本的本体文本挖掘
3. ETBE (ethyl text butyl ether) and TAME (text amyl methyl ether) affect microbial community structure and function in soils [J] . Johanna Bartling, Jiirgen Esperschutz, Berndt-Michael Wilke, Journal of Hazardous Materials . 2011,第1a3期

机译：ETBE（乙基文本丁基醚）和TAME（文本戊基甲基醚）影响土壤中的微生物群落结构和功能
4. Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences [C] . Saikat Mukherjee, I.V. Ramakrishnan On the Move Confederated International Conference . 2004

机译：驯服非结构化：从部分标记的原理图文本序列创建结构化内容
5. Validating a theory-based model of L2 reading comprehension: Relative contributions of content -specific schematic knowledge and L2 vocabulary knowledge to comprehending a science text [D] . Oh, Eunjou 2010

机译：验证基于理论的L2阅读理解模型：特定内容的示意图知识和L2词汇知识对理解科学课本的相对贡献
6. Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system [O] . Beata Fonferko-Shadrach, Arron S Lacey, Angus Roberts, 2019

机译：使用自然语言处理从非结构化临床信函中提取结构性癫痫数据：ExECT（癫痫临床文本摘录）系统的开发和验证
7. Collecting Indicators of Compromise from Unstructured Text of Cybersecurity Articles using Neural-Based Sequence Labelling [O] . Zi Long, Lianzhi Tan, Shengping Zhou, 2019

机译：使用基于神经基序列标记收集来自网络安全文章的非结构化文本的折衷指标
8. Security Classification Using Automated Learning (SCALE): Optimizing Statistical Natural Language Processing Techniques to Assign Security Labels to Unstructured Text [R] . Brown, J. D., Charlebois, D. 2010

机译：使用自动学习的安全性分类（sCaLE）：优化统计自然语言处理技术，将安全标签分配给非结构化文本

Taming the Unstructured: Creating Structured Content from Partially Labeled Schematic Text Sequences

摘要

著录项

相似文献

相关主题

期刊订阅