首页> 外文会议>Canadian conference on artificial intelligence >Selective Retrieval for Categorization of Semi-structured Web Resources
【24h】

Selective Retrieval for Categorization of Semi-structured Web Resources

机译:选择性检索半结构化Web资源的分类

获取原文

摘要

A typical on-line content directory contains factual information about entities (e.g., address of a company) together with entity categories (e.g., company's industries). The categories are a salient element of the system as they allow users to browse for entities of a chosen type. Assigning categories manually can be a challenging task, considering that an entity can belong to few out of hundreds of categories (e.g., all possible industry types). Instead we suggest to augment this process with an automatic categorization system that suggests categories based on the entity's home page. To improve the accuracy of results, the system follows links extracted from the home page and uses retrieved content to expand an entity's term profile. The profile is later used by a multi-label classification system to assign categories to the entity. The key element of the system is a link ranking module, which uses home page features (e.g., position and anchor text of links) to select links that are most likely to improve the categorization results. Evaluation on a data set of nearly ten thousand company home pages confirmed that the link ranking approach allows the system to limit the retrieval and processing costs to allow real-time responses and still outperform the categorization results of baseline systems.
机译:典型的在线内容目录包含有关实体(例如公司的地址)的事实信息以及实体类别(例如公司的行业)。类别是系统的重要元素,因为它们允许用户浏览所选类型的实体。考虑到一个实体可能属于数百个类别中的少数几个类别(例如,所有可能的行业类型),因此手动分配类别可能是一项艰巨的任务。相反,我们建议使用自动分类系统来增强此过程,该系统会根据实体的主页建议类别。为了提高结果的准确性,系统会跟踪从主页提取的链接,并使用检索到的内容来扩展实体的术语表。该配置文件随后由多标签分类系统用来为实体分配类别。该系统的关键要素是链接排名模块,该模块使用首页功能(例如链接的位置和锚文本)来选择最有可能改善分类结果的链接。对将近一万个公司主页的数据集的评估证实,链接排名方法使系统可以限制检索和处理成本,以实现实时响应,并且仍然胜过基线系统的分类结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号