【24h】

Classifying Websites into Non-topical Categories

机译:将网站分类为非主题类别

获取原文

摘要

With the large presence of organizations from different sectors of economy on the web, the problem of detecting to which sector a given website belongs to is both important and challenging. In this paper, we study the problem of classifying websites into four non-topical categories: public, private, non-profit and commercial franchise. Our work treats each website and all pages from the site as a single entity and classifies the entire website as opposed to a single page or a set of pages. We analyze both the textual features including terms, part-of-speech bigrams and named entities and structural features including the link structure of the site and URL patterns. Our experiments on a large set of websites related to weight loss and obesity control, under a multi-label classification setting using the SVM classifier, reveal that with a careful selection and treatment of features based on keywords, one can achieve an F-measure of 70% and that adding structural, part-of-speech and named entity based features further improves the F-measure to 74%. The improvement is more significant when textual features are not accurate or sufficient.
机译:从不同经济部门在网络上组织的大型存在检测问题该部门指定网站属于既重要又具有挑战性。在本文中,我们研究了分类网站的问题分解成四个非外用类:公立,私立,非盈利和商业特许经营。我们的工作将每一个网站,并从网站作为一个单一的实体和整个网站而不是单个页面或一组页面进行分类的所有页面。我们分析这两个文本功能,包括术语,部分的语音双字母组和命名实体和结构特点,包括网站和URL模式的链接结构。我们对大量的有关减肥和控制肥胖,多标签分类使用SVM分类设置下,网站的实验,揭示的具有基于关键字的特点进行仔细的选择和治疗,一个可以实现的F-措施70%,而加入的结构,部件的词性和命名实体基于特征进一步改进了F-措施74%。改善更为显著当文字特征是不准确或充分。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号