...
首页> 外文期刊>Expert Systems with Application >Using linguistic features to automatically extract web page title
【24h】

Using linguistic features to automatically extract web page title

机译:使用语言功能自动提取网页标题

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Existing methods for extracting titles from HTML web page mostly rely on visual and structural features. However, this approach fails in the case of service-based web pages because advertisements are often given more visual emphasize than the main headlines. To improve the current state-of-the-art, we propose a novel method that combines statistical features, linguistic knowledge, and text segmentation. Using annotated English corpus, we learn the morphosyntactic characteristics of known titles and define a part-of-speech tag patterns that help to extract candidate phrases from the web page. To evaluate the proposed method, we compared two datasets Titler and Mopsi and evaluated the extracted features using four classifiers: Naive Bayes, k-NN, SVM, and clustering. Experimental results show that the proposed method outperform the solution used by Google from 0.58 to 0.85 on Titler corpus and from 0.43 to 0.55 on Mopsi dataset, and offers a readily available solution for the title extraction problem. (C) 2017 Elsevier Ltd. All rights reserved.
机译:从HTML网页提取标题的现有方法主要依赖于视觉和结构特征。但是,这种方法在基于服务的网页上失败了,因为与主要标题相比,广告通常具有更多的视觉强调。为了改进当前的最新技术,我们提出了一种结合统计功能,语言知识和文本分割的新颖方法。使用带注释的英语语料库,我们可以了解已知标题的词法和句法特征,并定义词性标记模式,有助于从网页中提取候选短语。为了评估提出的方法,我们比较了两个数据集Titler和Mopsi,并使用四个分类器评估了提取的特征:朴素贝叶斯,k-NN,SVM和聚类。实验结果表明,该方法在Titler语料库上的使用效果优于Google的0.58至0.85,在Mopsi数据集上的使用效果优于0.43至0.55,为标题提取问题提供了一种易于使用的解决方案。 (C)2017 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号