首页> 外文会议>ACM/IEEE-CS joint conference on Digital libraries >Automatic document metadata extraction using support vector machines
【24h】

Automatic document metadata extraction using support vector machines

机译:使用支持向量机器自动文档元数据提取

获取原文

摘要

Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a Support Vector Machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [17] and EbizSearch[24]. We believe it can be generalized to other digital libraries.
机译:自动元数据生成为数字库及其集合提供可扩展性和可用性。机器学习方法提供鲁棒和适应的自动元数据提取。我们介绍了一种基于支持向量机分类的方法,用于研究论文的标题部分的元数据提取,并表明它在同一任务上表现出其他机器学习方法。该方法首先将标题的每行分类为15个类中的一个或多个。然后使用迭代收敛过程来通过在前一轮中使用其邻线的预测类标签来改善线条分类。通过寻求每行的最佳块边界来完成进一步的元数据提取。我们发现发现和使用数据和基于域的Word群集的结构模式可以提高元数据提取性能。适当的特征规范化也大大提高了分类性能。我们的元数据提取方法最初旨在提高数字图书馆CITESEER [17]和EBIZSearch [24]的元数据提取质量。我们认为它可以推广到其他数字图书馆。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号