首页> 外文会议>IEEE International Conference on Data Mining Workshops >A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar
【24h】

A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar

机译:使用条件随机字段和随机正则语法的概率地址解析器

获取原文

摘要

Automatic semantic annotation of data from databases or the web is an important pre-process for data cleansing and record linkage. It can be used to resolve the problem of imperfect field alignment in a database or identify comparable fields for matching records from multiple sources. The annotation process is not trivial because data values may be noisy, such as abbreviations, variations or misspellings. In particular, overlapping features usually exist in a lexicon-based approach. In this work, we present a probabilistic address parser based on linear-chain conditional random fields (CRFs), which allow more expressive token-level features compared to hidden Markov models (HMMs). In additions, we also proposed two general enhancement techniques to improve the performance. One is taking original semi-structure of the data into account. Another is post-processing of the output sequences of the parser by combining its conditional probability and a score function, which is based on a learned stochastic regular grammar (SRG) that captures segment-level dependencies. Experiments were conducted by comparing the CRF parser to a HMM parser and a semi-Markov CRF parser in two real-world datasets. The CRF parser out-performed the HMM parser and the semi-Markov CRF in both datasets in terms of classification accuracy. Leveraging the structure of the data and combining the linear-chain CRF with the SRG further improved the parser to achieve an accuracy of 97% on a postal dataset and 96% on a company dataset.
机译:来自数据库或Web的数据的自动语义注释是数据清理和记录链接的重要预处理。它可用于解决数据库中字段对齐不完善的问题,或标识可比较的字段以匹配来自多个源的记录。由于数据值可能是嘈杂的,例如缩写,变体或拼写错误,因此注释过程并非易事。特别地,重叠特征通常存在于基于词典的方法中。在这项工作中,我们提出了一种基于线性链条件随机字段(CRF)的概率地址解析器,与隐马尔可夫模型(HMM)相比,该解析器可以提供更具表达力的令牌级功能。此外,我们还提出了两种通用的增强技术来提高性能。一种是考虑数据的原始半结构。另一个是通过结合解析器的条件概率和得分函数对解析器的输出序列进行后处理,该函数基于学习的随机常规语法(SRG)来捕获段级别的依存关系。通过在两个真实的数据集中比较CRF解析器,HMM解析器和半Markov CRF解析器来进行实验。就分类准确性而言,CRF解析器在两个数据集中均优于HMM解析器和半马尔可夫CRF。利用数据的结构并将线性链CRF与SRG结合使用,进一步改善了解析器的准确性,在邮政数据集上的准确性达到97%,在公司数据集上的准确性达到96%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号