首页> 外文会议>SIAM International Conference on Data Mining >Exploiting Structured Reference Data for Unsupervised Text Segmentation with Conditional Random Fields
【24h】

Exploiting Structured Reference Data for Unsupervised Text Segmentation with Conditional Random Fields

机译:利用带有条件随机字段的无监督文本分段的结构化参考数据

获取原文
获取外文期刊封面目录资料

摘要

Text segmentation is the process of converting information in unstructured text into structured records. This is an important problem since structured data is amenable to efficient query processing. CRFs are a class of discriminative probabilistic models that are gaining acceptance as an effective computing machinery for text segmentation. An important aspect of CRFs is learning model parameters from labeled training data. Labeling can be a labor intensive process. One can avoid the labeling step by using structured reference tables whose data domains and that of the input text data given for segmentation, coincide. In other words the labels in the training data drawn from reference tables "come for free". Inspired by recent work on their use for training HMMs, we developed an unsupervised technique for text segmentation with CRFs using reference tables. Assuming text sequences to be segmented come in batches and sequences in a batch conform to the same attribute order, we build CRF models for each attribute in the reference table, use them to decide the attribute order of a batch of input sequences, derive labeled training data from the reference table according to that order, and train a global CRF model to segment the input sequences in the batch. Preliminary experimental results indicate that our technique works well in practice.
机译:文本分段是将非结构化文本中信息转换为结构化记录的过程。这是一个重要问题,因为结构化数据适用于有效查询处理。 CRFS是一类判别概率模型,该模型正在接受作为文本细分的有效计算机械。 CRFS的一个重要方面是从标记的训练数据学习模型参数。标签可以是劳动密集型过程。可以通过使用结构化的参考表来避免标记步骤,其数据域和分段给出的输入文本数据的数据域和输入文本数据。换句话说,从参考表中汲取的训练数据中的标签“免费”。灵感来自最近对培训HMMS的使用,我们开发了一种无监督的技术,用于使用参考表与CRF的文本分段。假设要分段的文本序列有批处理和序列符合相同的属性顺序,我们为每个属性构建了参考表中的每个属性的CRF模型,使用它们来决定批次输入序列的属性顺序,导出标记的培训根据该顺序从参考表中的数据,并训练全局CRF模型,以划分批次中的输入序列。初步实验结果表明,我们的技术在实践中运作良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号