Using linguistic features to automatically extract web page title

Gali Najlah; Mariescu-Istodor Radu; Franti Pasi

首页> 外文期刊>Expert Systems with Application >Using linguistic features to automatically extract web page title

【24h】

Using linguistic features to automatically extract web page title

机译：使用语言功能自动提取网页标题

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Existing methods for extracting titles from HTML web page mostly rely on visual and structural features. However, this approach fails in the case of service-based web pages because advertisements are often given more visual emphasize than the main headlines. To improve the current state-of-the-art, we propose a novel method that combines statistical features, linguistic knowledge, and text segmentation. Using annotated English corpus, we learn the morphosyntactic characteristics of known titles and define a part-of-speech tag patterns that help to extract candidate phrases from the web page. To evaluate the proposed method, we compared two datasets Titler and Mopsi and evaluated the extracted features using four classifiers: Naive Bayes, k-NN, SVM, and clustering. Experimental results show that the proposed method outperform the solution used by Google from 0.58 to 0.85 on Titler corpus and from 0.43 to 0.55 on Mopsi dataset, and offers a readily available solution for the title extraction problem. (C) 2017 Elsevier Ltd. All rights reserved.

机译：从HTML网页提取标题的现有方法主要依赖于视觉和结构特征。但是，这种方法在基于服务的网页上失败了，因为与主要标题相比，广告通常具有更多的视觉强调。为了改进当前的最新技术，我们提出了一种结合统计功能，语言知识和文本分割的新颖方法。使用带注释的英语语料库，我们可以了解已知标题的词法和句法特征，并定义词性标记模式，有助于从网页中提取候选短语。为了评估提出的方法，我们比较了两个数据集Titler和Mopsi，并使用四个分类器评估了提取的特征：朴素贝叶斯，k-NN，SVM和聚类。实验结果表明，该方法在Titler语料库上的使用效果优于Google的0.58至0.85，在Mopsi数据集上的使用效果优于0.43至0.55，为标题提取问题提供了一种易于使用的解决方案。（C）2017 Elsevier Ltd.保留所有权利。

著录项

来源
《Expert Systems with Application》 |2017年第8期|296-312|共17页
作者
Gali Najlah; Mariescu-Istodor Radu; Franti Pasi;
展开▼
作者单位

Univ Eastern Finland, Sch Comp, Machine Learning Grp, FI-80101 Joensuu, Finland;

Univ Eastern Finland, Sch Comp, Machine Learning Grp, FI-80101 Joensuu, Finland;

Univ Eastern Finland, Sch Comp, Machine Learning Grp, FI-80101 Joensuu, Finland;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Web content mining; Information extraction; Title extraction; Natural language processing; Machine learning;

机译：Web内容挖掘;信息提取;标题提取;自然语言处理;机器学习;

相似文献

外文文献
中文文献
专利

1. Relation Extraction from Web Contents with Linguistic and Web Features（言語分析およびWeb上の情報を用いたコンテンツからの関係の抽出） [J] . 顔玉蘭人工知能学会志 . 2011,第1期

机译：使用语言和Web功能从Web内容中提取关系（使用Web上的信息进行语言分析和从内容中提取关系）
2. Automatic linguistic knowledge acquisition for the web [J] . Werner Winiwarter International journal of web information systems . 2011,第1期

机译：网络上的自动语言知识获取
3. Linguistic features and automatic classifiers for identifying mild cognitive impairment and dementia [J] . Laura Calza, Gloria Gagliardi, Rema Rossini Favretti, Computer speech and language . 2021,第Jana期

机译：识别轻度认知障碍和痴呆症的语言特征和自动分类器
4. Automatically extracting features for concept learning from the web [C] . William W. Cohen International conference on machine learning . 2000

机译：自动提取来自Web的概念学习的功能
5. Title of ITP (Theory): A projection of Africa: Sociality and African Cinema. Title of ITP (Feature length script): Dead Trees [D] . Pulos, Rick 2007

机译：ITP标题（理论）：非洲的预测：社会和非洲电影。 ITP标题（功能长度脚本）：死树
6. Automatic localization of three-dimensional cephalometric landmarks onCBCT images by extracting symmetry features of the skull [O] . Bala Chakravarthy Neelapu, Om Prakash Kharbanda, Viren Sardana, 2018

机译：三维头颅地标的自动定位通过提取头骨的对称特征的CBCT图像
7. Using linguistic features to automatically extract web page title [O] . Najlah Gali, Radu Mariescu-Istodor, Pasi Fränti 2017

机译：使用语言功能自动提取网页标题
8. Feature and Extractor Evaluation Concepts for Automatic Target Recognition (ATR) [R] . Ross, T. D. , Westerkamp, L. A. , Gadd, D. A. , 1995

机译：自动目标识别（aTR）的特征和提取器评估概念

Using linguistic features to automatically extract web page title

摘要

著录项

相似文献

相关主题

期刊订阅