首页> 中文期刊> 《计算机应用与软件》 >基于正则表达式构建学习的网页信息抽取方法

基于正则表达式构建学习的网页信息抽取方法

         

摘要

正则表达式作为信息抽取领域中的一种常用方法已经被广泛应用多年.然而构建高质量并且复杂度较高的正则表达式通常需要耗费大量人工成本,为此,提出一种基于正则表达式状态转换的算法来学习复杂正则表达式的构建过程.该算法需要给定输入初始正则以及正反例样本,初始正则表达式在经过析取分离与合并交叉两大类正则表达式状态转换之后,得到候选正则表达式集合,利用F值评估候选项的信息抽取效果,通过贪心的启发式策略选择一个最优正则表达式作为输出.在多种数据集上对算法进行测评.实验表明,该算法性能与准确度均优于常规的机器学习方法.尤其在较小规模训练集和跨数据集上依然有较好的效果.%As one of the main methods in the field of information extraction,the method based on regular expression has been widely used for many years.However,the construction of regular expressions is with high quality and high complexity,it is usually required to spend a lot of manual efforts.Therefore,a method based on regular expression state transition is proposed to learn the construction of complex regular expressions.The method takes in a given initial input RegEx and both positive and negative labeled samples,a collection of candidate RegEx is got after applying two main kind of regular expressions transformation on the input RegEx,based on F value assessment of the candidate RegEx on the information extraction task,the algorithm selects an optimal regular expressions as output by greedy heuristic strategy.The performance of this algorithm is evaluated on multiple datasets.Experiments show that the performance and accuracy of the proposed method outperforms those of the standard machine learning methods.And it still has a good effect on condition of small scale training set and cross domain data set.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号